WEBVTT

1
00:04:32.608 --> 00:04:40.798
Silence.

2
00:04:44.069 --> 00:04:51.809
Silence.

3
00:04:55.319 --> 00:05:08.459
Silence.

4
00:05:11.548 --> 00:05:18.809
Silence.

5
00:05:20.249 --> 00:05:24.418
Silence.

6
00:05:29.699 --> 00:05:33.778
Silence.

7
00:05:37.228 --> 00:05:41.999
Silence.

8
00:05:41.999 --> 00:05:48.749
Silence.

9
00:05:52.499 --> 00:05:56.098
Silence.

10
00:05:56.098 --> 00:06:01.079
Silence.

11
00:06:08.038 --> 00:06:11.879
Silence.

12
00:06:24.178 --> 00:06:28.468
Silence.

13
00:06:29.819 --> 00:06:45.209
Silence.

14
00:06:57.329 --> 00:07:04.559
Silence.

15
00:07:13.678 --> 00:07:17.278
Silence.

16
00:07:19.408 --> 00:07:25.048
Silence.

17
00:07:29.759 --> 00:07:38.189
Silence.

18
00:08:00.869 --> 00:08:06.809
Okay, good morning. Good afternoon. People.

19
00:08:08.158 --> 00:08:14.278
Let's see night universal question.

20
00:08:14.278 --> 00:08:17.488
Anyone hear me.

21
00:08:18.838 --> 00:08:22.379
Okay.

22
00:08:23.668 --> 00:08:28.259
Couple was getting all of my beautiful Thank you.

23
00:08:28.259 --> 00:08:36.629
Okay, so this is parallel computing class 5, I guess December 8.

24
00:08:36.629 --> 00:08:42.989
2021, I do that, so that if we look at the videos video gets to sort of detached from.

25
00:08:42.989 --> 00:08:46.739
It's title, we know what it is.

26
00:08:46.739 --> 00:08:58.168
So, what's happening today is I got some utility information about connecting to remote computers and file systems and so on.

27
00:08:58.168 --> 00:09:06.839
Working with operating systems and then finish off open MP and talk about open ACC.

28
00:09:06.839 --> 00:09:10.379
And will be moving on after that into and video.

29
00:09:10.379 --> 00:09:14.609
And I've got a homework 3 is a chance for you to program with.

30
00:09:14.609 --> 00:09:20.188
So, connecting to other computers, you're connecting to parallel.

31
00:09:20.188 --> 00:09:30.509
Then you can just say, you can just type in your password, but is actually a public key system.

32
00:09:30.509 --> 00:09:44.788
Method and it's it's actually better to create key pairs. And if you create a key pair, then you do not need to type passwords to connect in. For example, if I show you.

33
00:09:44.788 --> 00:09:48.119
I bring up a window here.

34
00:09:49.649 --> 00:09:55.889
Make it bigger if I want to connect to parallel I just say.

35
00:09:55.889 --> 00:09:59.489
You see, I just.

36
00:10:01.048 --> 00:10:13.499
And you'll notice, I did not have to type a password because on parallel I created a public private key pair and I copied the public key over.

37
00:10:13.499 --> 00:10:16.589
Give my local laptops, so.

38
00:10:16.589 --> 00:10:22.198
And I'll let you read the manuals and so, and if you're on a Windows host, then you'd use putty or whatever.

39
00:10:22.198 --> 00:10:30.808
And you can also, it will also forward if you look at the top left corner up here to also forward X connections and so on.

40
00:10:30.808 --> 00:10:33.958
By default. Okay. So lose that now.

41
00:10:33.958 --> 00:10:37.019
There are questions of.

42
00:10:37.019 --> 00:10:43.198
You can create some key passes and I've got some information here.

43
00:10:43.198 --> 00:10:57.749
Now, other ways to work advantages that gives you is that you can mount remote file systems back on your local computer. You can access remote files. Let me demo that.

44
00:10:57.749 --> 00:11:01.889
I go to the file tab here and I can.

45
00:11:01.889 --> 00:11:08.938
Go down and, um, other locations and I say, connect to server.

46
00:11:15.389 --> 00:11:18.658
Maybe I want to do this. I see.

47
00:11:18.658 --> 00:11:23.009
What am I doing here?

48
00:11:25.379 --> 00:11:30.089
How did I type in SF slash slash? Let's say.

49
00:11:30.089 --> 00:11:36.328
Okay, so what I've now done here, this is now mapped.

50
00:11:36.328 --> 00:11:45.418
The parallels computer parallels file system over to my local laptop. And if I don't like using a browser.

51
00:11:45.418 --> 00:11:50.849
There's another way I can get at it and I'm, I'm a Linux host here.

52
00:11:50.849 --> 00:11:55.469
It's part of my local file system, so I can use all my command line tools.

53
00:11:55.469 --> 00:12:00.178
And it's going to be run user.

54
00:12:03.719 --> 00:12:07.859
It's under.

55
00:12:07.859 --> 00:12:14.698
And right there, so I can connect to there.

56
00:12:14.698 --> 00:12:20.339
And again, there is the parallel file system.

57
00:12:20.339 --> 00:12:28.349
As part of the name space for my local computer, so I can go into here and just access it like.

58
00:12:28.349 --> 00:12:41.099
Like, a local file, a couple of points don't try to do really cute things, simultaneous reading and writing and selling anything complicated and fancy might not work that. Well.

59
00:12:41.099 --> 00:12:45.599
Um, the, um, it's a fuse saying it says this.

60
00:12:45.599 --> 00:12:56.399
File user space, file system. Okay. And you do not have more rights than you would normally have like, it's going to the root of the file system here. If I said to say, touched food like that.

61
00:12:56.399 --> 00:13:02.908
You see, it's going to complain because it I don't get rude rights. So this is a nice way to.

62
00:13:04.019 --> 00:13:10.619
To access the remote file system locally, use emax and so on on it.

63
00:13:10.619 --> 00:13:14.249
So, I've come back to here, so we talk about.

64
00:13:14.249 --> 00:13:27.594
That there, other things you can do is, you can also use to run commands on your remote file system, for example, let me go back to it.

65
00:13:27.894 --> 00:13:34.553
This is just DS is just well, Here's another cool thing. Get on a UNIX file system.

66
00:13:34.764 --> 00:13:46.043
Dev, sham is a file system that's built into memory into DRAM, core, main memory and its size is 1 half the physical amount of here.

67
00:13:46.948 --> 00:13:54.359
Main memory, so this machine here is 100008 gigabytes of memory. Historical temporary file system is potentially.

68
00:13:54.359 --> 00:14:01.019
64 gigabytes and since it's actually in DRAM, it is really, really.

69
00:14:01.019 --> 00:14:13.649
Passed so you don't want to worry about latency and everything with your disk put your files in dev shem s. H. M. for shared memory. And you won't have any I. O. time.

70
00:14:13.649 --> 00:14:19.889
So that's just 1 little hint there in any case. So, other things you can do.

71
00:14:19.889 --> 00:14:30.239
It can run a single command on their file so this will run a single command on parallel.

72
00:14:31.438 --> 00:14:39.389
For example, so you just want to run 1 quick command.

73
00:14:39.389 --> 00:14:47.938
Another thing I can do is I can copy single files back and forth and so on. So I have a file here.

74
00:14:47.938 --> 00:14:52.259
Okay, I want to copy that.

75
00:14:53.879 --> 00:15:01.979
Let's say full 2 on.

76
00:15:01.979 --> 00:15:05.938
And now this is just copied that who? Over to parallel.

77
00:15:05.938 --> 00:15:09.418
And we can.

78
00:15:11.249 --> 00:15:21.119
There it is, I could put a sequence of command after here. Good. Even around something interactive. If you wanted.

79
00:15:21.119 --> 00:15:24.359
Here to some extent. Okay, so.

80
00:15:24.359 --> 00:15:34.678
What I'm showing you are tools to work with remote computers and enlarge your tool set. So you can copy files, copy whole directory.

81
00:15:34.678 --> 00:15:40.889
Oh, the cool thing is that even does file name completion on remote files single commands and so on.

82
00:15:40.889 --> 00:15:47.339
Oh, that's 1 thing. Another thing about file systems.

83
00:15:47.339 --> 00:15:51.568
Is on parallel.

84
00:15:51.568 --> 00:15:56.249
Here's my parallel window.

85
00:15:56.249 --> 00:16:00.178
Okay, if I go to.

86
00:16:05.038 --> 00:16:08.849
Okay, now.

87
00:16:10.703 --> 00:16:18.354
Does he have this file system has some features that might be useful for you? 1 of them? Is it transparently compresses?

88
00:16:18.594 --> 00:16:27.354
So you have no need to use g step and so on on your files, because you store them on this file system and that's where your users are.

89
00:16:27.629 --> 00:16:34.109
Okay, you started them on the file system. Your files are automatically compressed.

90
00:16:34.109 --> 00:16:42.448
So, that just makes life easier. Another thing that this has is, this is under the Jan 20.

91
00:16:44.158 --> 00:16:47.249
Pool, he might call it and.

92
00:16:47.249 --> 00:16:53.729
Another thing that has is if it's working, let me see if it is.

93
00:16:53.729 --> 00:17:00.989
Yeah.

94
00:17:00.989 --> 00:17:07.199
Okay, so this is doing automatic snapshots.

95
00:17:07.199 --> 00:17:11.098
Every 15 minutes, and then every hour.

96
00:17:11.098 --> 00:17:19.169
They are the hourly ones, and then every day, and then every week, and then every month and delete the old ones.

97
00:17:19.169 --> 00:17:23.578
And so if you delete a file.

98
00:17:23.578 --> 00:17:28.288
It may be in 1 of these snapshots, and you can just go back and get it.

99
00:17:29.003 --> 00:17:42.923
Now, the way CFS is able to do this, it has a copy on right philosophy. So it's getting into an operating systems. Course file systems. Course. So nice things about CFS. So, 1, it compresses.

100
00:17:43.344 --> 00:17:45.743
2 was a snapshot thing. Is that.

101
00:17:47.128 --> 00:17:53.548
Oh, these snapshots are atomic, it snapshots the whole file system at the same time.

102
00:17:53.548 --> 00:17:58.288
But it just makes a note. So the snapshots cost is basically free.

103
00:17:58.288 --> 00:18:04.138
And it only so if you overwrite a file, then.

104
00:18:04.138 --> 00:18:09.898
It makes a new copy, but as long as the file was not changed, then.

105
00:18:11.578 --> 00:18:17.278
It doesn't make a separate copy, so it's not using enormous quantities of disk space.

106
00:18:17.278 --> 00:18:24.269
Which is nice now, I'm just to give you an example.

107
00:18:25.648 --> 00:18:31.409
So, let me look up.

108
00:18:33.419 --> 00:18:36.659
What am I now.

109
00:18:36.659 --> 00:18:40.229
If you're changing things.

110
00:18:40.229 --> 00:18:44.519
So, if I look at a snapshot, let's say.

111
00:18:45.808 --> 00:18:51.838
Oh, I don't know, let me look at a frequent 1.

112
00:18:55.108 --> 00:19:00.479
Okay.

113
00:19:04.858 --> 00:19:07.949
Okay, let me give you an example.

114
00:19:09.358 --> 00:19:20.308
Let me just look at this. Okay, so we've got, this is what it looks like for the, the snap. These are universal time. Okay so let me look at.

115
00:19:20.308 --> 00:19:23.788
The, um, snapshot for a while back.

116
00:19:25.558 --> 00:19:28.709
Oh, say December.

117
00:19:30.778 --> 00:19:35.489
And we don't have the same stop there.

118
00:19:36.683 --> 00:19:48.624
So, if if a file existed long enough to be captured in a snapshot and long enough, that, that snapshot didn't get deleted. Like you said, the frequent ones are only stored for a day because you've got the hourly ones.

119
00:19:48.624 --> 00:19:53.634
And so, on any case, this is something I've used this once or twice when I accidentally deleted something.

120
00:19:53.939 --> 00:19:57.838
Okay, so that's nice stuff with CFS.

121
00:19:58.733 --> 00:20:09.983
Any questions that another operating system thing. So, I don't know if our operating system courses teach current file systems and so on and also some other nice features.

122
00:20:10.193 --> 00:20:23.064
You can clone a whole file system and now you've got 2 versions of it. And again, it doesn't duplicate the space until you start changing stuff. So you've got a tree structure thing, 2 versions of the file system, which you now.

123
00:20:23.368 --> 00:20:31.138
Do whatever you want, and our separate file systems, but if a file was the same in both were in both clones. Only 1 copy of it is stored.

124
00:20:31.138 --> 00:20:37.679
Okay, another operating system thing.

125
00:20:37.679 --> 00:20:42.509
Which is relevant is stacks.

126
00:20:42.743 --> 00:20:56.544
So I think most of you, I mean, are aware of this, that, you know, you've got this push down stack for local variables on your computer, you call a function sub routine, send me the name synonymous.

127
00:20:56.544 --> 00:21:01.344
Actually, it puts a new stack frame on the stack and local variables are allocated.

128
00:21:01.913 --> 00:21:03.233
In on the stack,

129
00:21:03.564 --> 00:21:06.354
and then when you return from the function,

130
00:21:06.624 --> 00:21:17.723
the stack is unwind and all the local variables are cleared automatically this is separate from the heap where it shows a global thing and you explicitly allocate stuff on the heap and you explicitly free it.

131
00:21:18.058 --> 00:21:25.409
Matlock and free are construct and free and so and destroy and so on now, with the stack.

132
00:21:25.409 --> 00:21:31.378
See, you might wonder what happens to the stack when you've got threads like an open an.

133
00:21:31.378 --> 00:21:37.858
The answer is that every thread has its own independent stack.

134
00:21:37.858 --> 00:21:42.989
That's created when the thread starts and is destroyed when the thread is finished.

135
00:21:42.989 --> 00:21:46.798
Now, by default, they're very small, but you could make them bigger.

136
00:21:46.798 --> 00:21:54.058
So, if I come over here, go over here it, let me make things big. Let's say.

137
00:21:54.058 --> 00:21:57.659
So, if I do you limit.

138
00:21:57.659 --> 00:22:04.618
Stack size that would be minus.

139
00:22:04.618 --> 00:22:10.558
S, here that's I made it quite big actually, but by default.

140
00:22:10.558 --> 00:22:15.298
By by default attached, if I do a new tab.

141
00:22:15.298 --> 00:22:19.588
That's a parallel.

142
00:22:20.699 --> 00:22:24.778
It's very small 8 megabytes.

143
00:22:24.778 --> 00:22:35.249
And if you run a program, which is using the stack, and you try to put more than a Meg of local variables on the stack, the program will crash.

144
00:22:35.249 --> 00:22:38.909
However, you can.

145
00:22:38.909 --> 00:22:45.028
You can increase it, so that would be stat, the limit.

146
00:22:45.028 --> 00:22:50.189
Minus SAS and make it something bigger. Let's say if we do that.

147
00:22:50.189 --> 00:22:53.189
The stacked size is a reasonable size.

148
00:22:53.189 --> 00:22:59.098
And so if you're going to be running programs, using the stack, you want to make the stack size bigger now.

149
00:22:59.098 --> 00:23:04.709
You might be worrying that this is, um, you know, this is parallel.

150
00:23:04.709 --> 00:23:08.398
56 hyper threads here then.

151
00:23:08.398 --> 00:23:14.278
You give each hyper thread, a few gigabytes of stack. You're really.

152
00:23:14.278 --> 00:23:22.229
Wasting a lot of memory. Well, no, you're not because on Linux, when you have a page of virtual memory.

153
00:23:22.229 --> 00:23:26.338
All a page of virtual memories initially it's an entry in a table.

154
00:23:26.338 --> 00:23:32.219
And the memory is not actually allocated until you touch it.

155
00:23:32.219 --> 00:23:44.818
So, if you don't, so if you make a humungous stack, it doesn't matter until you actually touch it. So there's no problem with.

156
00:23:44.818 --> 00:23:49.798
Having big stacks because of the work from memory manager and.

157
00:23:49.798 --> 00:24:00.473
You know, this solves the problem they doing some operating system courses. If you have 1 stack in your program, it grows up from the bottom. You have a 2nd stack and your program grows down from the top of your available memory.

158
00:24:00.834 --> 00:24:06.534
And what what, if you want more stacks on the page manage system, it doesn't matter.

159
00:24:06.808 --> 00:24:12.298
You can also there's a program which will show this, or, you.

160
00:24:18.358 --> 00:24:24.568
And.

161
00:24:24.568 --> 00:24:29.788
I have not updated. Okay. I got some, um.

162
00:24:29.788 --> 00:24:33.929
I got some pointers I have, it's an open empty I think.

163
00:24:33.929 --> 00:24:37.348
Oh, okay.

164
00:24:37.348 --> 00:24:43.348
Good it's just all copy. It's the stack size all copy it over.

165
00:24:51.358 --> 00:24:56.519
And what the program does.

166
00:24:57.628 --> 00:25:04.798
It's still using it a couple of nice things here. Let me show you locales.

167
00:25:05.814 --> 00:25:16.884
So, in C, plus plus locale for a program sets characteristics, such as how you print numbers, like, here, we separate every 3 digits with a comma in Europe.

168
00:25:16.884 --> 00:25:21.953
They might separate every 3 digits with a period and separate the decimals with the here.

169
00:25:22.229 --> 00:25:33.538
Says, how do you print how you print your numbers? And so on what I've done here is if I sent this locale, then when I print big numbers.

170
00:25:33.538 --> 00:25:37.048
It will put the columns in after every 3 digits. That's fun.

171
00:25:37.048 --> 00:25:41.159
I just set locale now for the stack size.

172
00:25:41.159 --> 00:25:45.209
You can get you can get resource limits.

173
00:25:45.209 --> 00:25:48.868
And it feeds into a structure, our lamb.

174
00:25:48.868 --> 00:25:52.679
And it's type is our limit up here so.

175
00:25:54.449 --> 00:26:01.499
Okay, and so what we do here in our lamb's got all of these fields such as our limb. Docker.

176
00:26:01.499 --> 00:26:04.888
Is the current.

177
00:26:04.888 --> 00:26:11.489
Resource, and is the maximum.

178
00:26:11.489 --> 00:26:15.479
To hear what I can go in, and it will just double it.

179
00:26:15.479 --> 00:26:21.118
So, get our laminate got the size set. Our limit will set the resource.

180
00:26:21.118 --> 00:26:25.108
And it will set it in here and I went down here.

181
00:26:25.108 --> 00:26:29.128
And I tried to set it, and then I printed the new 1 down here.

182
00:26:29.128 --> 00:26:37.348
So, see, well, if I run the program, you see, initially the currently I'm at was 1 gigabyte.

183
00:26:37.348 --> 00:26:51.898
And then the maximum was more than you possibly want to use, you see the advantage of putting a column, every 3 digits and after doubling, it went from 1 gigabyte up to 2 here. So.

184
00:26:51.898 --> 00:26:55.679
And has actually reading the current thing I could say, you.

185
00:26:57.148 --> 00:27:01.858
I make it really small, something like that. Now, I run stack size.

186
00:27:01.858 --> 00:27:05.128
You see, the 100 here was pages.

187
00:27:05.128 --> 00:27:09.749
So it was initially 100 K a page being 1. K. and now I made it bigger.

188
00:27:09.749 --> 00:27:14.219
If I run the program and now it doubled and so on.

189
00:27:14.219 --> 00:27:28.138
Let me look at the program 1 more time. Oh, what happened here is I made the stack size so small the more program won't even run. So, maybe I'd better make it a little bigger.

190
00:27:28.138 --> 00:27:38.638
Good and at the end of here so, and this is a test here if I.

191
00:27:40.259 --> 00:27:46.409
You know, if things fail and I try to access.

192
00:27:46.409 --> 00:27:49.648
This There'll be a local variable on the stack.

193
00:27:49.648 --> 00:27:59.278
And if something happens, I'll get a 2nd fall. So, the segue fault is the message that your stack local stack was too small.

194
00:27:59.278 --> 00:28:10.828
Okay, so these are programming tools that will help you. And this is this concept of a local stat I large local stack for each thread. I think this is a useful programming tool that is under used. So.

195
00:28:10.828 --> 00:28:16.318
Workman needs his toolkits as toolbox and these are toolbox.

196
00:28:16.318 --> 00:28:19.469
For the Linux programmer.

197
00:28:19.469 --> 00:28:25.888
Multiple stacks large stacks, local file system and DRAM.

198
00:28:25.888 --> 00:28:33.058
Very powerful tools that are under used. Okay if you have your favorite tools, yourself mentioned them to me.

199
00:28:33.058 --> 00:28:37.618
Oh.

200
00:28:37.618 --> 00:28:43.288
Um.

201
00:28:43.288 --> 00:28:54.028
Or that just a SEC, I'm ignoring the phone because the chat, when it's available for people to talk.

202
00:28:54.028 --> 00:29:01.618
Okay, now to open MP specifically. Oh.

203
00:29:01.618 --> 00:29:05.278
Um, open empty.

204
00:29:05.278 --> 00:29:09.148
And as a lot of just to remind you, we have the website here.

205
00:29:09.148 --> 00:29:15.689
Lots of information if you want a lot of free.

206
00:29:15.689 --> 00:29:26.999
Stuff is being added gradually at the moment open. Empty is weak for handling GPU back end. So I'm giving demonstrations here on the multi.

207
00:29:29.153 --> 00:29:42.834
And so also, some of the best documentation is obsolete, but okay, so I'm showing you various examples. I'm not going to show you the Lawrence Livermore. Lmc Lawrence Livermore, national Labs has some stuff also more information.

208
00:29:42.834 --> 00:29:52.134
I'm not going to handle that. I'm going to skip that it's available if you want to read it. Well, let me go through it. Look at some of these. So, there's a lot of directives here.

209
00:29:52.409 --> 00:29:58.259
For things defining how data gets copied into the parallel.

210
00:29:58.259 --> 00:30:07.169
Threads data does it get what data shared? What data is private get copied in a reductional is important. I'll hit reduction in a minute.

211
00:30:07.169 --> 00:30:12.058
And atomic okay, atomic directives.

212
00:30:12.058 --> 00:30:16.679
Again, to mark a section of code, which will be done.

213
00:30:16.679 --> 00:30:20.489
I only want so you don't get these.

214
00:30:20.489 --> 00:30:23.669
You know, these problems with.

215
00:30:25.163 --> 00:30:36.173
2 threads trying to write to the same data at the same time thing is to force things to get serialized barriers. Have the obvious meaning. And atomic was serialized the next simple instruction.

216
00:30:36.173 --> 00:30:44.153
A critical will serialized and arbitrarily big block of code, but the overhead to start a critical block as much larger.

217
00:30:44.429 --> 00:30:48.388
The other things apparently obvious.

218
00:30:48.388 --> 00:30:58.048
Okay, so I mentioning the problem here about serialization. We talked about it last time. I'm.

219
00:30:58.048 --> 00:31:02.699
Spending more time on this, because this is the curse of parallel computing.

220
00:31:02.699 --> 00:31:06.118
If you've got 2 threads do load and store.

221
00:31:06.118 --> 00:31:11.128
And they could go in any order and different order every time.

222
00:31:11.128 --> 00:31:18.509
And get a different answer and, of course, they make do the same order every time and giving it consistently the wrong answer.

223
00:31:18.509 --> 00:31:23.128
So, I mentioned critical and atomic.

224
00:31:23.128 --> 00:31:26.219
Okay, I mentioned this sort of thing before.

225
00:31:26.219 --> 00:31:30.388
Now, how do you compile your programs? Well.

226
00:31:31.314 --> 00:31:46.044
If I go back to here, I mean, I have things in the make I mean, everyone's aware of make so, I say, use the f open empty. Now, what you have to add a flag to use open MP, what the name of the flag depends on the compiler.

227
00:31:46.318 --> 00:31:54.419
So, grumble. Okay. I mentioned real time. I mentioned this before. I believe about.

228
00:31:54.419 --> 00:31:59.699
The real number properties. Okay. Tasks. I showed you quickly. Last time.

229
00:31:59.699 --> 00:32:03.929
I love this the beautiful thing. This is how you do Fibonacci. Exactly.

230
00:32:03.929 --> 00:32:09.298
Okay, let me just review tasks. I showed you to you last time.

231
00:32:10.439 --> 00:32:15.358
Okay, I'll show you different task things.

232
00:32:15.358 --> 00:32:21.148
So, the concept here, you can explicitly access.

233
00:32:21.148 --> 00:32:25.558
The parallel threads and what we have here.

234
00:32:28.499 --> 00:32:40.558
Is this starts a parallel thread now? What it does more technically, is you have a queue because you may have more tasks and you have threads. I got 56 hyper threads here.

235
00:32:41.243 --> 00:32:55.163
You're not restricted for having 56 actual threads and your program you could have as many as you want and there's an argument for having more than 56 actually, because it would always be some queued up ready to run. If all the current ones are blocked and safe waiting on I, or something.

236
00:32:56.219 --> 00:33:03.719
So, in any case, you can create explicit tasks and this creates a explicit task.

237
00:33:03.719 --> 00:33:10.199
And it's for doing Fibonacci, and they go into the queue and they run in parallel for as many tasks.

238
00:33:10.199 --> 00:33:14.098
You know, as many tasks as the computer can run. So.

239
00:33:14.098 --> 00:33:23.548
My laptop is dual 6 core, so it could run. My laptop could run the same program as parallel, but the parallel has dual.

240
00:33:23.548 --> 00:33:26.848
14 cause my laptop's only dual 6 core.

241
00:33:26.848 --> 00:33:33.929
So, There'll be more tasks waiting in the queue on my laptop in any case. So, this recursively starts.

242
00:33:33.929 --> 00:33:39.269
2 parallel tasks doing a Fibonacci number recursively and.

243
00:33:39.269 --> 00:33:44.519
They run and then they return back to the while. They're going to wait.

244
00:33:44.519 --> 00:33:50.459
And now we have an atomic here, we want to total up the number of tasks and.

245
00:33:50.459 --> 00:34:01.858
So that's done anatomically and incremental hoping of variables 1 of the legal things for a task for an atomic. And then at the end, we wait.

246
00:34:01.858 --> 00:34:13.259
So this waits for the tasks that were fired up here. So, inside the else, we've got a task and a task, and a task. Wait, wait until those 2 tasks.

247
00:34:13.259 --> 00:34:17.998
Were started, and the tasks are also firing up recursively other tasks.

248
00:34:17.998 --> 00:34:22.619
So you can, but, you know, you want to be reasonable about it because it.

249
00:34:24.028 --> 00:34:32.759
It was an overhead with all of them. Let me show you another task program task.

250
00:34:44.278 --> 00:34:48.719
A couple of things in this program I want to show you 1st.

251
00:34:48.719 --> 00:34:57.898
I've got some really cool macros up here. What these macros do is.

252
00:34:59.248 --> 00:35:06.628
Print for an art will print argh. As a as a string and then it will evaluate it and print the value.

253
00:35:06.628 --> 00:35:16.048
So, if I look down here, no, so this will print literally get threads and then it will evaluate it and print the result.

254
00:35:16.048 --> 00:35:22.349
So this will print with a calm and a new. So this I wrote this to help me.

255
00:35:22.349 --> 00:35:32.668
You know, the bug programs I add is a great concept and what we have here are some control sequences for your terminal that will cause the terminal to change color.

256
00:35:32.668 --> 00:35:38.009
Okay, so in any case what's happening here?

257
00:35:38.009 --> 00:35:44.398
As I'm not doing it necessarily. Recursively. Okay what's happening here?

258
00:35:44.398 --> 00:35:49.498
Is I'm starting the a lot of parallel.

259
00:35:49.498 --> 00:35:55.708
Threads here and what this will do is this will start running the contents of that block.

260
00:35:55.708 --> 00:36:04.739
In parallel on every available thread, or the number of threads that the program's configured. So this will run that block 56 times.

261
00:36:04.739 --> 00:36:10.829
In parallel, which might not be what you want, but that's what it will do.

262
00:36:10.829 --> 00:36:18.059
And this will print, so this block here.

263
00:36:18.059 --> 00:36:21.869
We'll run in a 1 to the 56 threads.

264
00:36:21.869 --> 00:36:27.449
But what's different is the thread number will be different for each thread. So.

265
00:36:27.449 --> 00:36:32.938
And then the critical says, do this on only 1 thread at a time.

266
00:36:32.938 --> 00:36:36.539
So, um.

267
00:36:36.539 --> 00:36:47.429
Barrier waits still the threads are done. So I was wrong. This went down as far as here. My mistake and the barrier make sure everything is finished.

268
00:36:47.429 --> 00:36:55.048
And then master says, run something on only the master thread.

269
00:36:55.048 --> 00:36:59.998
Perhaps, and then run a pile of tasks and parallel.

270
00:37:01.199 --> 00:37:06.628
Course.

271
00:37:06.628 --> 00:37:10.168
Silence.

272
00:37:10.168 --> 00:37:15.478
Okay, so what we have up here is, um.

273
00:37:15.478 --> 00:37:28.079
A mess. Okay because this was not serialized. So, what we got is everything saying starting parallel. Okay. Um, so, Max, a number of threads was 56.

274
00:37:28.079 --> 00:37:32.639
So, starting parallel here is going to be written 56 times.

275
00:37:32.639 --> 00:37:40.619
And then these are like, the thread numbers and so on big mess here. So shows things have to be serialized.

276
00:37:40.619 --> 00:37:48.958
And getting the thread number so this is the concept here. You see it prints the expression actually, and red, and then evaluates it. And Prince of value.

277
00:37:48.958 --> 00:37:55.228
Okay, there's another example, we start a lot of tasks and they run in parallel and then they all finish.

278
00:37:55.228 --> 00:38:01.018
Other ways to do things in parallel. Okay.

279
00:38:05.039 --> 00:38:08.458
Let me show you some other stuff here.

280
00:38:08.458 --> 00:38:13.889
Start CC to read things here.

281
00:38:18.179 --> 00:38:24.179
So, this just shows examples of the various things that you can read in.

282
00:38:24.179 --> 00:38:36.030
Open MP number threads. Are you in a parallel block at the time and so on and so on the wall clock time so this just shows examples of getting a lot of them.

283
00:38:36.030 --> 00:38:41.460
Oops, let me come back again here and in the main program.

284
00:38:41.460 --> 00:38:48.809
So, I'm doing the block in parallel, but what this says here is actually do it only single.

285
00:38:48.809 --> 00:39:03.269
Only do this in 1 of the 56 threads. So don't repeat this block for every possible threat do it for only 1 of them. The reason I'm doing parallel and then single is we can get it. So, let us get the number of threads and so on then.

286
00:39:03.269 --> 00:39:07.409
Silence.

287
00:39:07.409 --> 00:39:12.599
Yeah, and get all the various things here.

288
00:39:12.599 --> 00:39:21.179
Oh, okay. You could use W, time here to get the lapse time for your program.

289
00:39:21.179 --> 00:39:28.320
And the thing is, Linux is a high resolution, low resolution clock the standard clock's like a 6th of a 2nd.

290
00:39:28.320 --> 00:39:33.809
Which isn't actually fine enough so, with the high risk clock here is better.

291
00:39:33.809 --> 00:39:40.650
Okay, so 50 and by the way you can set this as I showed you last time was that environment variable.

292
00:39:40.650 --> 00:39:45.480
That was here. Okay. Hello? Single barrier.

293
00:39:46.920 --> 00:39:52.349
Silence.

294
00:39:52.349 --> 00:40:01.679
Hey, professor yes quick question. How is the, uh.

295
00:40:01.679 --> 00:40:05.789
Delta all time function in the common file.

296
00:40:05.789 --> 00:40:09.480
How accurate it is to which the delta.

297
00:40:09.480 --> 00:40:13.320
Delta all the time I think it's called. Oh, my clock function.

298
00:40:13.320 --> 00:40:17.190
Yeah, as I hope it's good. I don't guarantee it.

299
00:40:17.190 --> 00:40:31.735
Okay, if you're going to time something for publication, you want to run it a couple of times the 1st time you run a program may take more time the 2nd time you run that stuff will be in the cash and will be faster. So, you run a couple of times.

300
00:40:32.219 --> 00:40:35.309
And see, nothing else is on the system and see so.

301
00:40:36.989 --> 00:40:40.079
Okay, so a single barrier.

302
00:40:44.039 --> 00:40:47.579
So, what's happening here? We're just.

303
00:40:47.579 --> 00:40:55.800
Putting a barrier around things, so this executes only once or inside the parallels. The number of threads will be 56.

304
00:40:55.800 --> 00:41:04.559
So the critical said to the Hello world will get printed without being all scrambled because this whole line will get executed once.

305
00:41:04.559 --> 00:41:07.679
And let's try this.

306
00:41:07.679 --> 00:41:11.489
Silence.

307
00:41:11.489 --> 00:41:18.539
Yeah, so all the Hello world were not scrambled up.

308
00:41:18.539 --> 00:41:28.260
Except for that last line, which got scrambled up, because I needed 1 more barrier here or something, but every time I run it, of course, the thread. So different.

309
00:41:28.260 --> 00:41:35.489
Okay, so I think I said I need 1 more barrier at the end here. So things.

310
00:41:35.489 --> 00:41:39.630
Would not get scrambled? Oh, it's okay. No Instagram. Well.

311
00:41:39.630 --> 00:41:43.500
Yeah, this last thing okay.

312
00:41:43.500 --> 00:41:49.769
That another way to do things in parallel is you could explicitly parallel lies stuff.

313
00:41:49.769 --> 00:41:53.460
Um.

314
00:41:53.460 --> 00:41:57.929
Silence.

315
00:42:01.795 --> 00:42:15.534
What's happening here so Here's a Delta clock time and Sunday here, I'm setting a locale. So prince with com is, this is a routine I wrote for Delta clock time, which should what it does. Is it prints the elapse times. That's the last time I called it.

316
00:42:15.809 --> 00:42:19.889
So the 1st, okay, now, what's happening here?

317
00:42:19.889 --> 00:42:28.889
As I have explicit parallel sections, but they're not in a for loop and they're not like recursive tasks or something.

318
00:42:28.889 --> 00:42:35.429
There just 2 thing parts of my program that I say, they don't depend on each other. I can do them in parallel.

319
00:42:37.980 --> 00:42:51.659
So, I've got 2 sections here section so this section is for loop that creates C and this section here those 2 will run in parallel.

320
00:42:51.659 --> 00:43:03.269
So, it's the programmers job to ensure they don't step on each other's toes. So I create sections and the way I do that is I have so that look at the, that is section singular.

321
00:43:03.269 --> 00:43:07.289
I put all of the sections and a sections plural.

322
00:43:07.289 --> 00:43:11.369
Craig MA. Okay. So now.

323
00:43:11.369 --> 00:43:18.360
It's like a case statement or something. I got all of the so each 1 of these as many as they want, and they all around and.

324
00:43:19.829 --> 00:43:31.949
And they all run simultaneously to the extent possible if not, they'll get thrown on the queue to run when there's threads available. And then I have to do all of that inside a parallel.

325
00:43:31.949 --> 00:43:38.099
So, I create the parallel environment and the parallel environment goes on as far as here.

326
00:43:38.099 --> 00:43:43.170
And inside the parallel environment, then everything gets done.

327
00:43:43.170 --> 00:43:48.960
Multiple times on every thread, except if it says master, it's done only on the master thread.

328
00:43:48.960 --> 00:43:56.519
And then the sections thing, then inside the parallel, and the sections are each farmed out to separate threads.

329
00:44:04.349 --> 00:44:12.690
And if we look at this is a CPO load 165%. So, on the average average, I was using, you know.

330
00:44:12.690 --> 00:44:19.530
More than 1 thread to do it. It's very useful. Now, the reason this gets printed out is I've got.

331
00:44:19.530 --> 00:44:30.420
An environment, variable set, so any jaw any command it takes more than a few seconds will automatically run time on it. And here's the locale putting column. I said, I like that.

332
00:44:30.420 --> 00:44:34.079
The way I get the time to print out automatically.

333
00:44:34.079 --> 00:44:47.820
It is report time so this causes a report if the command takes more than a 2nd.

334
00:44:47.820 --> 00:44:53.909
Then it prints out the time that group see, at the time the command took up here.

335
00:44:53.909 --> 00:44:58.110
Ok, sections.

336
00:44:59.340 --> 00:45:06.510
So do a section, you set up a parallel block, you define all your sections I need to find the sections 1 to 1.

337
00:45:06.510 --> 00:45:11.099
So, in my, this is my abbreviation.

338
00:45:11.099 --> 00:45:14.099
For in my not so humble opinion.

339
00:45:14.099 --> 00:45:17.159
Open M. P. is easier than.

340
00:45:17.695 --> 00:45:18.864
Threads and fork.

341
00:45:20.815 --> 00:45:21.355
Okay,

342
00:45:21.355 --> 00:45:22.704
so I've showed you,

343
00:45:22.704 --> 00:45:32.065
that's about as much as I'm going to show you with open MP stack overflows questions on it in a section in a task or a section,

344
00:45:32.065 --> 00:45:36.864
you're explicit the section command the sections run at the same time.

345
00:45:37.349 --> 00:45:45.264
Explicitly tasks are sections are more synchronous to the extent possible and tasks are told asynchronous,

346
00:45:45.534 --> 00:45:53.545
you fire off a task and you return to the caller and the task who's just sitting this runs when it can you can wait for it to finish.

347
00:45:53.965 --> 00:45:58.945
Might be a good idea, or sections the sections and sections directive then.

348
00:45:59.219 --> 00:46:09.809
There run well, the section directive waits typically until they finish. So okay. That's open MP. It's a step above P threads.

349
00:46:09.809 --> 00:46:17.670
But you're still very prescriptive about what you're doing. Now, problems with open. M. P. is it's still weak on.

350
00:46:17.670 --> 00:46:22.440
So, they're just lately been adding support to it.

351
00:46:22.440 --> 00:46:34.949
And so to do parallel on, I'd recommend other tools, like the next 1 I'm going to talk about, but I want to introduce you to open empty because it's it is a major.

352
00:46:34.949 --> 00:46:44.429
It's a major parallelization tool been around. I started at 20 years ago, so it's mature enough wide base of users.

353
00:46:44.429 --> 00:46:48.929
So you can put on your resume that you have written in an open MP program.

354
00:46:48.929 --> 00:46:52.409
And I've got various things here. You could.

355
00:46:52.409 --> 00:47:01.139
And I've got other stuff from last year, a problem. Oh, 1 more thing I forgot to show you my mistake reduce.

356
00:47:01.139 --> 00:47:06.929
Show you the, some problem again. Okay. You remember this.

357
00:47:08.849 --> 00:47:15.030
Okay, you got your parallel for and it's got this and.

358
00:47:15.030 --> 00:47:18.869
Computed is going to be wrong because of different threads step on each other.

359
00:47:18.869 --> 00:47:27.480
Now, you could put this computed block in a critical. It's very slow. You could put it in an atomic, which is faster.

360
00:47:27.480 --> 00:47:30.929
But for things like this, or you're summing into a total.

361
00:47:30.929 --> 00:47:38.579
There's this is called a reduction operating operation. We're reducing the factor of arguments.

362
00:47:38.579 --> 00:47:41.969
To a total or something else.

363
00:47:41.969 --> 00:47:48.780
Okay, there's a special construct to do this because this is a common thing that people want to do.

364
00:47:48.780 --> 00:47:53.340
And so there's a conflict that does it much faster called the reduced construct.

365
00:47:53.340 --> 00:47:58.530
Let me show you that.

366
00:48:06.750 --> 00:48:15.239
Okay, notice inside the for loop we do not have an atomic or critical or anything.

367
00:48:15.239 --> 00:48:20.610
What we have in the starting Craig about pregnenolone parallel 4 and here's the new 1.

368
00:48:22.170 --> 00:48:27.690
What this says is that they've got the variable computer that's down here.

369
00:48:27.690 --> 00:48:31.079
Is being it's a reduction.

370
00:48:31.079 --> 00:48:36.329
Uh, it's the output from a reduction and the reduction is the plus operator.

371
00:48:36.329 --> 00:48:40.980
So this tells open M. P that inside the 4 loop.

372
00:48:40.980 --> 00:48:44.699
We are going to be computed. It's going to be some.

373
00:48:44.699 --> 00:48:50.280
Of a lot of local variables could be anything and to do it fast.

374
00:48:50.280 --> 00:48:58.139
So, what open M P will do is it will have a separate local version of computed for each thread.

375
00:48:58.139 --> 00:49:10.500
So each thread will not some into the global computed, will someone to a local, total, sub, total variable and at the end, all the local subtotal variables are summed into the global.

376
00:49:10.500 --> 00:49:19.590
So, it's very efficient. You don't need any locks or atomics or criticals at all. So it's going to be fast and it's going to be correct.

377
00:49:21.570 --> 00:49:25.559
Silence.

378
00:49:28.170 --> 00:49:32.309
Correct answer.

379
00:49:32.309 --> 00:49:42.690
Now, we look at this again, this was doing a reduce with a some, there's a, there's a number of other operators you can use.

380
00:49:42.690 --> 00:49:51.869
The basic thing, if this is to work, the operator has to be commutative and associative. So you could reduce the, a product, a max or a men.

381
00:49:52.764 --> 00:50:06.625
A logic wise or wise, and but you could not reduce, for example, a minus because minus subtraction is not commutative. There's a list of there's only a specific list of operators that you can reduce there in the documentation.

382
00:50:06.929 --> 00:50:12.510
Um, show you some others.

383
00:50:12.510 --> 00:50:15.780
Silence.

384
00:50:15.780 --> 00:50:19.500
Silence.

385
00:50:19.500 --> 00:50:29.909
Um, what are we doing? We're looking at numbers of threads. Okay. What we're doing here is that we are.

386
00:50:29.909 --> 00:50:34.380
Looking at times, and so on so here we do a reduce.

387
00:50:34.380 --> 00:50:37.889
And printing all sorts of stuff.

388
00:50:37.889 --> 00:50:46.440
Oops, and it's an attempt to do it on the GPU and the way, and something like this.

389
00:50:46.440 --> 00:50:49.920
Um, is an attempt to.

390
00:50:49.920 --> 00:50:58.530
Have it compile for the GPU and say, send teams and whatever and.

391
00:51:08.159 --> 00:51:12.059
And the concept is, it's a little faster perhaps.

392
00:51:12.059 --> 00:51:18.659
We could also some theories now it's so interesting.

393
00:51:18.659 --> 00:51:25.019
Other things to show you all. Well, let me show you another working program.

394
00:51:25.019 --> 00:51:28.230
Where you can get sizes.

395
00:51:33.480 --> 00:51:38.219
This shows you, how you can get the sizes of different data types.

396
00:51:38.219 --> 00:51:43.230
Size of if you can give it a data type as an argument.

397
00:51:43.230 --> 00:51:51.480
And it will return the size and fights. This can be useful because nothing in the sequels plus standard defines.

398
00:51:51.480 --> 00:52:03.420
The size of anything, or the defines relative sizes a short can be no longer than an end and it can be no longer than a long and so on. But could be the same size. But this is a way to.

399
00:52:03.420 --> 00:52:07.650
And get the sizes of different data types can be useful.

400
00:52:13.440 --> 00:52:22.800
But, in fact, an interest for bites and does the common sense along, but notice long long it's no longer than long. They're both 8 bytes.

401
00:52:22.800 --> 00:52:25.949
So, this sort of thing is useful here.

402
00:52:25.949 --> 00:52:29.309
Other things to show you useful.

403
00:52:29.309 --> 00:52:32.340
The way we can do.

404
00:52:32.340 --> 00:52:36.630
With Matt on, so.

405
00:52:36.630 --> 00:52:40.590
Silence.

406
00:52:42.929 --> 00:52:47.489
And this is just trying to play with.

407
00:52:47.489 --> 00:52:55.110
Things I never did get it working. Right? Matrix all application.

408
00:52:59.699 --> 00:53:06.960
Just some playing with doing matrices in parallel.

409
00:53:12.210 --> 00:53:16.590
This is just a sequential thing here.

410
00:53:18.000 --> 00:53:23.010
We could play with it. We could copy in.

411
00:53:23.010 --> 00:53:27.210
Silence.

412
00:53:27.210 --> 00:53:31.980
What is happening here?

413
00:53:33.420 --> 00:53:38.699
We're trying to do the thing in parallel and see what happens.

414
00:53:43.889 --> 00:53:51.269
And, yeah, and the theory is that.

415
00:53:51.269 --> 00:53:55.920
He previously it took almost 4 seconds real.

416
00:53:55.920 --> 00:54:07.585
Now, it took a 3rd of a 2nd rail, so apparently went a lot faster. And by the way I've turned off optimization with the compiling here, just so not to confuse things.

417
00:54:07.585 --> 00:54:11.454
If you turned on optimization everything would go very much faster.

418
00:54:17.010 --> 00:54:21.030
Are we doing here?

419
00:54:21.030 --> 00:54:26.159
Nothing too interesting. Oh, okay.

420
00:54:26.159 --> 00:54:34.019
Okay, so that's a good executive introduction.

421
00:54:34.019 --> 00:54:39.960
To Matrix, 2 open, empty.

422
00:54:41.429 --> 00:54:46.829
Open AC scene. Well, 1st, I want to talk about lots and lots of compilers.

423
00:54:46.829 --> 00:54:51.059
I mean, that's not even all of them. This is also Intel so.

424
00:54:51.059 --> 00:54:58.079
The examples I've been giving you here, they've been using g. plus plus it's a very nice. Compiler does open. Well.

425
00:54:58.079 --> 00:55:01.530
Now, we're, we're.

426
00:55:01.530 --> 00:55:11.400
Migrating and to use and could a few days, which is CUDA is invidious low level.

427
00:55:11.400 --> 00:55:14.730
Compile language.

428
00:55:14.730 --> 00:55:19.050
Um, for Nvidia has a.

429
00:55:19.050 --> 00:55:24.719
Compiler called you give it so kudos some extensions to C. plus plus.

430
00:55:24.719 --> 00:55:30.690
And you need something like NBC to compile g. plus plus doesn't know.

431
00:55:30.690 --> 00:55:45.150
Now, at times I get annoyed with g plus plus it takes a while to adapt to new hardware. There's a commercial compiler called pgc. Plus plus it's commercial, but it's free for, like, amateur usage.

432
00:55:45.150 --> 00:55:51.929
And I think it may be better than g. plus plus, especially for compiling say to Nvidia and so on.

433
00:55:51.929 --> 00:55:55.019
And we have P. G. C. plus plus there.

434
00:55:55.019 --> 00:55:58.019
Right.

435
00:56:00.235 --> 00:56:14.905
So, that might almost be better than cheap. Plus. Plus, the thing is, the names of the flags and so on will be different. In any case. I may start switching over to that for say, open MP. Now, invidia and pgc.

436
00:56:14.905 --> 00:56:21.414
Plus, plus, I believe might have been partly sponsored by NVIDIA. Now I think this is my interpretation.

437
00:56:21.690 --> 00:56:28.860
Is that Nvidia took over? Plus plus and re badge said as plus plus.

438
00:56:28.860 --> 00:56:37.199
And video C plus plus, and they've added some new features, and I'm thinking that this is possibly the best here.

439
00:56:37.199 --> 00:56:41.489
And.

440
00:56:41.489 --> 00:56:46.769
So, you can I have not yet installed it on parallel installed that.

441
00:56:46.769 --> 00:56:57.179
But in any case of open M. P. g plus plus works fine, you could use this but some of the flags I different pgc plus process some nice debugging features. I may show you.

442
00:56:57.179 --> 00:57:06.809
Any case, so that's and I have a homework 3 here playing with open MP. Okay. Open ACC. So.

443
00:57:06.809 --> 00:57:16.889
This is a newer thing than open M. P. it it's more abstract like open a open. You're very specific.

444
00:57:16.889 --> 00:57:27.570
About what you do, you say, parallelize this for loop have these 2nd, run these sections and parallel or fire off tasks to go into the queue to actually it's open empty.

445
00:57:27.570 --> 00:57:31.530
Is very specific about the parallelization.

446
00:57:31.530 --> 00:57:36.269
But it hide some of the low level bookkeeping stuff that you have to worry about with P threads.

447
00:57:36.269 --> 00:57:40.829
So you almost might say open and to the same powers piece or it's but it's easier.

448
00:57:40.829 --> 00:57:44.130
Open may give you a little more.

449
00:57:44.130 --> 00:57:55.530
Okay, and the hardest part for any of this is, your algorithm has to be parallelizable. That's the hard part. Now open. Acc is higher level than it's newer than.

450
00:57:55.530 --> 00:58:10.170
Open M. P. and it's I think it's useful by now. I have a rule. I don't like using something until it's 10 years old, but I open ACC is useful again. It has a wide industry support so it's a living system. People use it.

451
00:58:10.170 --> 00:58:16.199
People it gets extended and so that's nice. I like to have living systems.

452
00:58:16.199 --> 00:58:31.045
That are widely used, not just toy systems as much my insulting thing and to open an open ACC it's higher level than open MP and open. Acc also works with devices open to open.

453
00:58:31.045 --> 00:58:44.784
mvp's been really late adding access to and they do it badly a non standard because the thing is, once the standard does it, then, of course, the compilers have to add as better than that. So that's reasons for open ACC.

454
00:58:45.295 --> 00:58:47.005
It has a wide.

455
00:58:47.880 --> 00:58:51.630
A lot of information here.

456
00:58:51.630 --> 00:58:55.679
And what I want to do is I want to show you.

457
00:58:55.679 --> 00:59:01.409
Walk you through some of the tutorials here, and maybe next time I'll run a few programs.

458
00:59:01.409 --> 00:59:05.849
And tutorials.

459
00:59:05.849 --> 00:59:10.679
And we have here.

460
00:59:15.090 --> 00:59:23.699
Okay, so you can watch the recording. You can I'm going to walk you through the slides and this will also give you an introduction to.

461
00:59:23.699 --> 00:59:28.440
And so piles of information available.

462
00:59:28.440 --> 00:59:31.500
And again also supported by NVIDIA.

463
00:59:36.150 --> 00:59:39.480
So, same chair and nothing here. Okay.

464
00:59:39.480 --> 00:59:45.719
Oops, okay. The 2nd.

465
00:59:45.719 --> 00:59:50.309
Silence.

466
00:59:50.309 --> 00:59:53.670
Okay.

467
00:59:57.119 --> 01:00:00.719
I don't want to do that.

468
01:00:00.719 --> 01:00:05.610
Okay, so I'm going to walk you through this.

469
01:00:05.610 --> 01:00:10.800
And just hit highlights.

470
01:00:10.800 --> 01:00:15.420
I notice it's been given an in various supports this.

471
01:00:15.420 --> 01:00:21.329
Um, so they talk about a whole week for this. I'm going to do it and.

472
01:00:21.329 --> 01:00:25.860
20 minutes whatever. Okay.

473
01:00:31.349 --> 01:00:37.679
So, it's competitive direct is like, open NPS and compile of directors and some library stuff.

474
01:00:37.679 --> 01:00:44.489
I'll show you add pragmatic instead of.

475
01:00:44.489 --> 01:00:48.719
Okay, parallel code.

476
01:00:48.719 --> 01:00:51.929
Um.

477
01:00:53.579 --> 01:01:00.869
Oh, different types of fragments here. There you go.

478
01:01:00.869 --> 01:01:07.619
Okay, this sorts of fragments talk about how the data like.

479
01:01:08.664 --> 01:01:20.275
You know, do you want to copy the data into the parallel region at the start copy out at the end? Do both the compiler can actually determine that much of the time but perhaps, you know, better than the compiler.

480
01:01:20.275 --> 01:01:25.824
And if you get explicit about the data movement, the program might compile better.

481
01:01:26.099 --> 01:01:31.679
So, if you don't if it's if your program is simple at all, be a select the compiler figure it out. Otherwise.

482
01:01:31.679 --> 01:01:41.429
Specify that set up a parallel region. Okay, this is.

483
01:01:41.429 --> 01:01:48.630
This says the compile it for and 7 D and everything that you've got. Basically.

484
01:01:48.630 --> 01:01:53.880
So, there's a loop coming up and gang means run it on the.

485
01:01:53.880 --> 01:01:57.900
Running on the CUDA cores on the so.

486
01:01:57.900 --> 01:02:01.289
Okay, similar to open MP this 1.

487
01:02:01.289 --> 01:02:09.420
So, many, many Corps, this refers to things like the Intel.

488
01:02:09.420 --> 01:02:20.519
5 Co processor card that I talked for a few years in this course, and stopped because Intel dropped the product a couple of years ago or a couple of years ago. So the.

489
01:02:20.519 --> 01:02:24.179
The Z on 5 was a CO processor card plugged into your.

490
01:02:24.179 --> 01:02:28.199
Machine that had 60 quad core.

491
01:02:28.199 --> 01:02:35.010
Z on on on it so it's called many core, 60 cores on the card.

492
01:02:36.684 --> 01:02:45.235
And so they were stripped down beyond they stripped out a lot of the things, like speculative execution and so on. So they took less hardware to build.

493
01:02:45.235 --> 01:02:59.425
But, if your code did not require things, like, speculative execution around very fast, and the card round ran a stripped down embedded version of Linux, and you could connect to it was SS H, and so on, and are using shared file sessions.

494
01:02:59.994 --> 01:03:09.144
But that sort of obsolete. So the SP, we call multi core, dual, 14 core with dual 20 call cells and, of course what each core is very small.

495
01:03:10.284 --> 01:03:18.175
Okay, your concept is that you write your code, your algorithm has to be paralyzed.

496
01:03:18.204 --> 01:03:26.635
All you got to hear that from me again and again, and you add annotations and the compiler then determines how to parallelize for the machine.

497
01:03:27.000 --> 01:03:34.559
And you can still run the program on your sequential machine. And the theory is that the compiler will compile for different architectures.

498
01:03:34.559 --> 01:03:43.469
And the concept is, you don't have to worry about some low level details, so you're not perhaps going to get the same total power.

499
01:03:43.469 --> 01:03:47.039
As if you were aware of the low level details, but your time.

500
01:03:47.039 --> 01:03:53.489
Is faster so this is basically the same as open.

501
01:03:53.489 --> 01:04:02.190
And you could, or you just say, parallelize the loop and your responsibility of the separate iterations as the loop don't affect each other.

502
01:04:03.329 --> 01:04:14.849
Lots of target devices. Ibm power is a very nice architecture. Actually used in a number of the top 500 super computers. Ibm. In fact.

503
01:04:14.849 --> 01:04:19.679
I ask myself what does IBM do? Well, today.

504
01:04:21.355 --> 01:04:35.994
And I can think of only 2 things in their cloud computing services, number 4 or 5, they had a couple percent of the market. Well, they have their mainframes, which are nice, but come a little obsolete. So, their cloud computing, I think is.

505
01:04:36.329 --> 01:04:46.469
Pointless almost, they just last week or 2 weeks ago, closed off their a block chain group. So they've decided blockchain.

506
01:04:46.469 --> 01:04:54.570
It's not a money maker isn't going anywhere, but what the hell it's going where their power architecture is very nice.

507
01:04:54.570 --> 01:05:00.750
And they plug in cards into it.

508
01:05:00.750 --> 01:05:15.534
Like, perhaps, and that makes a very nice supercomputer. Some of the top supercomputers are doing this and it's separate and video cards have a very fast boss between a faster than I can use using a Z on actually. So.

509
01:05:15.809 --> 01:05:26.280
Very IBM does very well so it's 1 thing. Ibm has they have the components with the Super computers? I think the other thing IBM does very well is quantum computing.

510
01:05:26.280 --> 01:05:30.150
They're of course, they're perhaps a leader nothing else.

511
01:05:31.349 --> 01:05:35.605
Okay, back to open ACC.

512
01:05:35.605 --> 01:05:49.284
So you've got the CPU and this is sort of showing multi core and the say the, and since invidia as the biggest part of the gpo market, everything I talk about will be invidia in 5 years.

513
01:05:49.559 --> 01:05:54.570
It might be something else, but okay, so lots of the on course.

514
01:05:54.570 --> 01:05:57.630
Program or goes to the compiler.

515
01:05:57.630 --> 01:06:03.539
Say, current also run the thing on the could, of course, perhaps.

516
01:06:06.449 --> 01:06:12.840
That's the same slide. They're trying to use in some important things.

517
01:06:12.840 --> 01:06:24.780
That's nice advertising, advertising, advertising. Well, good, nothing wrong with advertising. So lots of slides to tell you open ace open ACC as well.

518
01:06:24.780 --> 01:06:30.329
Please stick on your resume, you've used open a C. C. you programmed and open ACC. My mistake.

519
01:06:30.329 --> 01:06:34.230
Syntax pragmatics on.

520
01:06:35.849 --> 01:06:41.010
Nothing new there you can do for trend if you're unfortunate enough to have to you as for trend.

521
01:06:41.010 --> 01:06:46.199
Okay, so now we're going to use an example class heat transfer.

522
01:06:46.199 --> 01:06:50.820
Just ties at every note is the average of it's for neighbors.

523
01:06:52.170 --> 01:06:56.880
Do it in parallel.

524
01:06:56.880 --> 01:07:09.780
So iterating so we're iterating. And every node is here and then we enter each and every, and we actually do it. It converges. And we're ignoring stuff about over relaxation and so on. This is just an example.

525
01:07:10.980 --> 01:07:16.380
Here's your program and.

526
01:07:17.579 --> 01:07:21.179
They compute the average of the 4 neighbours we compute.

527
01:07:21.179 --> 01:07:28.530
How much it changed? So we know how it's converging and the next loop is we copy back.

528
01:07:28.530 --> 01:07:40.349
Okay, sequential program and we repeat this either until the era gets small or until we iterate too much.

529
01:07:44.010 --> 01:07:52.170
Okay, they're making a big point here. I've got to show you some profiling tools analyzing and profiling, because it may not be obvious what the.

530
01:07:52.170 --> 01:07:59.219
What's taking the time? I mean, what you think may be taking the time may not be what is taking the time. Okay.

531
01:07:59.219 --> 01:08:04.800
Profiling, and it turns out that the swap is taking.

532
01:08:04.800 --> 01:08:10.559
Almost half the time. Well, I always slow.

533
01:08:10.559 --> 01:08:17.850
And there are these things, so the examples are using the compilers and.

534
01:08:17.850 --> 01:08:23.310
Profiling tools, I'll run them for you on Thursday. I think this introduction.

535
01:08:23.310 --> 01:08:32.310
You can profile sequential code and it will show you what is what the different parts of the program what is taking the time.

536
01:08:32.310 --> 01:08:37.199
And 46% of the time was copying the array over.

537
01:08:40.590 --> 01:08:47.399
Okay, and if we go down another low level.

538
01:08:47.399 --> 01:08:50.909
And eventually the low level routines that are taking the time, but.

539
01:08:53.279 --> 01:08:59.010
Okay, and nice things of the profiling tools.

540
01:08:59.010 --> 01:09:05.579
Okay, that's okay. I actually so.

541
01:09:05.579 --> 01:09:09.689
Let me just for the moment, I want to scroll back to that.

542
01:09:09.689 --> 01:09:13.619
To show you what to remind you what the program looked like.

543
01:09:13.619 --> 01:09:20.069
Before, and then we'll come back to this page 34. this is a 2nd here.

544
01:09:20.069 --> 01:09:29.579
Gotcha. Okay so this thing here just at the copy was like, 46 of the time and this thing here, the computer, the new value is like.

545
01:09:29.579 --> 01:09:33.989
54% of the time. So, in other words.

546
01:09:33.989 --> 01:09:47.609
Call PNG was so this thing here, the copying was slow. Okay the computation was comparatively fast and that's often the case of parallel computing. The cost is dominated by the aisle time.

547
01:09:47.609 --> 01:09:54.989
Okay, back to page, if I'm scrolling back and forth so quickly that you're getting then let me know.

548
01:09:54.989 --> 01:09:59.729
Okay parallel so.

549
01:09:59.729 --> 01:10:02.970
These gangs are just groups of.

550
01:10:02.970 --> 01:10:08.699
Threads on the GPU.

551
01:10:08.699 --> 01:10:14.069
If you're ahead of me and GPU knowledge, they're tied into things like thread blocks.

552
01:10:14.069 --> 01:10:22.079
Okay, any case so this says we got these gangs here of threads. I'm anticipating a little.

553
01:10:24.060 --> 01:10:27.720
The NVIDIA the CUDA cores.

554
01:10:27.720 --> 01:10:35.310
Have about 3 levels of hierarchy and I'll write this on a future slide at the very lowest level.

555
01:10:35.310 --> 01:10:39.899
You've got 32 threads form a war.

556
01:10:39.899 --> 01:10:53.250
And the 32 threads, and a warp are executing the same instruction. So as an instruction decoder decodes an instruction and distributes it to all 32 threads are running the same instruction.

557
01:10:53.250 --> 01:11:06.715
On different code on different data. I'm sorry the only difference is that a thread can be disabled. So a thread has an enabled status fit and if the threat is disabled, it's not running the instruction. It's idle.

558
01:11:06.925 --> 01:11:10.375
But if the threat is enabled all 32 threads, and the war are running the same.

559
01:11:12.210 --> 01:11:15.539
Then the same instruction, so that's a warp.

560
01:11:15.539 --> 01:11:24.930
Now, the works of threads are grouped into what's called a thread block. So block might have a 1024 threads in it.

561
01:11:24.930 --> 01:11:28.350
32 war, so 32 threads.

562
01:11:28.350 --> 01:11:34.140
And the, the works and a block.

563
01:11:34.140 --> 01:11:48.984
They're scheduled independently there's a little operating system sitting on the GPU and the warps in the block. There's a queue of warps waiting to run as cubes everywhere. And so the warps can execute independently.

564
01:11:49.260 --> 01:12:00.930
But they still have connections to each other and that all the threads in the block have a shared memory, a block of shared memory that they can all read and write.

565
01:12:00.930 --> 01:12:04.050
So, the warps in a blog, have it.

566
01:12:04.050 --> 01:12:13.560
Share some memory, if they want to, they're not forced to I mean, a thread has private memory local to the thread, but there is a shared memory that's.

567
01:12:13.560 --> 01:12:17.970
Chaired by all the threads and the block and.

568
01:12:19.289 --> 01:12:33.689
Also the threads in a block and synchronize, they can, they can set up a barrier and wait till all of the threads and the block hit that barrier. So we got the threads in a warp and then the warps in a blocks. That's 2 levels.

569
01:12:33.689 --> 01:12:37.739
3rd level you got separate blocks.

570
01:12:37.739 --> 01:12:42.750
So, you going to multiple blocks in your program.

571
01:12:42.750 --> 01:12:52.590
As many as you want, basically of 1000 threads each and the separate blocks are scheduled separately and there's a Q a box.

572
01:12:52.590 --> 01:13:00.239
And they scarcely communicate with each other, there's global memory that they can read and write to, like.

573
01:13:00.239 --> 01:13:11.760
unparallel that card is 16 gigabytes of global memories. All of the blocks have access to the global memory, but basically the thread, the separate blocks.

574
01:13:11.760 --> 01:13:22.350
Don't interact with each other they could synchronize with each other, but that's probably a bad idea. It's going to really slow things down. So we got the thread warps.

575
01:13:22.350 --> 01:13:28.229
The block thread, block, single block and then the multiple blocks that's 3 levels.

576
01:13:28.229 --> 01:13:33.300
The multiple blocks form, a kernel kernels like a parallel program.

577
01:13:33.300 --> 01:13:37.409
Your GPU kind of multiple kernels so that's 4 levels.

578
01:13:37.409 --> 01:13:48.899
And the separate colonels don't interact with each other while they could read and write to the global same global memory. But they're probably not even doing that. The separate kernels are like, separate.

579
01:13:48.899 --> 01:13:58.770
Jobs on a parallel computer. Well, it is a parallel computer and there's a queue of colonels so 4 levels.

580
01:13:58.770 --> 01:14:03.779
Hierarchy at 4 levels on the GPU.

581
01:14:05.010 --> 01:14:11.250
And I could probably extend to a 5th level if I thought about it. Okay. But basically, 4 levels of.

582
01:14:11.250 --> 01:14:21.119
Hello ISM with threads. Okay so, here, we're in open ACC, a gang of threads in a block and then separate blocks. Basically that's not a hard thing.

583
01:14:21.119 --> 01:14:25.560
Okay.

584
01:14:25.560 --> 01:14:34.710
And so that's nothing interesting there. Do the loop in parallel do iterations of the loop. You saw this before? I'm going to go through this fast.

585
01:14:34.710 --> 01:14:38.760
Oh.

586
01:14:38.760 --> 01:14:43.199
Like, an open up an M. P. nothing new there.

587
01:14:45.270 --> 01:14:51.090
And watch your data dependencies that's your problem.

588
01:14:54.149 --> 01:14:58.500
It's too hard for the compiler nothing new there.

589
01:14:58.500 --> 01:15:08.100
Oh, the parallel directive? Yeah, that says, do everything inside the block here on every thread in parallel unless.

590
01:15:08.100 --> 01:15:12.060
There's something like a loop and nothing new there.

591
01:15:14.545 --> 01:15:28.944
Okay, this is still like opening open MP, but here we have a reduction my example before the reduction operator was plus the production operators, maximum maximum is associative and committed to.

592
01:15:29.364 --> 01:15:30.444
That's okay.

593
01:15:30.720 --> 01:15:40.079
So, here, this Max's pulled out by the compiler and does a max separately on each thread and then it combines all the sub access into a global Max.

594
01:15:40.079 --> 01:15:43.199
And then we parallelize the 2nd thing.

595
01:15:44.640 --> 01:15:48.989
And mentions the reduction clause here.

596
01:15:48.989 --> 01:15:52.710
I told you what it does synthesis right? Sit down.

597
01:15:53.970 --> 01:16:03.989
There's only a fixed set of reduction, legal reduction operators, because they have compiler support. They actually have supported a low level plus Max and so on.

598
01:16:03.989 --> 01:16:11.130
It wise run the Co. P. G. C. plus plus.

599
01:16:11.130 --> 01:16:20.279
And task does a nice set of has an enormous number of flags, including an enormous number of optimization flags.

600
01:16:20.279 --> 01:16:26.250
And past, this does a nice set of optimization flags.

601
01:16:26.250 --> 01:16:31.199
Men falls a really nice flag. It prints incredible amounts of debugging information.

602
01:16:31.199 --> 01:16:38.819
And file equals all I'm just introducing stuff. I'll demo at.

603
01:16:38.819 --> 01:16:45.149
And we've got various tags to say what to compile it for. I'll review this on Thursday.

604
01:16:45.149 --> 01:16:52.140
This says the Tesla means invidia GPU. It's historical reasons. It doesn't.

605
01:16:52.140 --> 01:16:58.289
Nvidia has generations of the G. P. S. army capital or.

606
01:16:58.289 --> 01:17:09.300
Volta and parent news 1 is ampcare. Previous 1 is Volta. The previous 1 is Pascal. Previous 1 is Maxwell. Previous 1 is capital. So Tesla.

607
01:17:09.300 --> 01:17:17.279
Was 1 generation of it's now being used to refer to all by the PG compilers.

608
01:17:17.279 --> 01:17:24.630
It's no relation to the car and it's no relation to the.

609
01:17:24.630 --> 01:17:28.680
Marketing level for Nvidia, which Tesla means.

610
01:17:30.329 --> 01:17:36.180
And I forget which, and Super computing level or something unrelated. Okay.

611
01:17:37.350 --> 01:17:46.829
So, what this, something like this would say, compiled target architecture, compiled to run in the GPU manage. They'll tell you about in a minute.

612
01:17:46.829 --> 01:17:50.760
And print lots of information and optimize it. So.

613
01:17:50.760 --> 01:17:58.949
And you compile it, and this is sort of information it prints out it prints out the optimization information about what it can optimize.

614
01:17:58.949 --> 01:18:05.550
And the speed up on multi core here, this would be actually.

615
01:18:05.550 --> 01:18:10.199
This should be still on the Intel 3 times faster.

616
01:18:10.199 --> 01:18:16.560
And here you see the system generated, implicit copy ins and copy outs.

617
01:18:16.560 --> 01:18:20.250
You didn't have to specify it um.

618
01:18:23.489 --> 01:18:29.310
Okay, so the 1st compiler thing was just for the Intel.

619
01:18:29.310 --> 01:18:36.420
About 3 times faster on on their particular machine we then compile it to run on the.

620
01:18:36.420 --> 01:18:46.590
And I got 37 times faster the 1 that's all. So, that's the 2nd, newest architect. So, this is quite a new architecture here. So.

621
01:18:49.199 --> 01:18:53.310
Oh, and here's what their many car was. Okay.

622
01:18:54.329 --> 01:19:02.850
Or tolerable chief CPO closing remarks. Okay. Good point to stop. Now, let me go back to my.

623
01:19:02.850 --> 01:19:06.449
Page here we go.

624
01:19:07.800 --> 01:19:14.970
Yeah, so what I did today, just to remind you.

625
01:19:14.970 --> 01:19:19.500
Is that.

626
01:19:19.500 --> 01:19:29.220
I gave you some operating system and useful tools about SS. Oh, I didn't mention a fast I should about and and.

627
01:19:29.220 --> 01:19:35.100
Stack size says, make your stacks bigger and then the allocate pages when needed.

628
01:19:35.100 --> 01:19:38.579
It doesn't matter of a large virtual memory that you don't use.

629
01:19:38.579 --> 01:19:43.050
Well, that's also true and allocating stuff on the heap, but.

630
01:19:44.189 --> 01:19:48.329
Allocate a big array until you touch it. It doesn't cost anything.

631
01:19:48.329 --> 01:20:00.689
And I finished off open empty. It's a very nice thing for Intel, and then to move and it's fairly the level that I started open ACC for, you.

632
01:20:00.689 --> 01:20:07.619
And which will be better for a high level for compiling to.

633
01:20:07.619 --> 01:20:10.619
The GPU and.

634
01:20:10.619 --> 01:20:14.069
I'll continue that next time we'll run some programs.

635
01:20:14.069 --> 01:20:25.979
Island install and V. C. plus plus I guess so. I would recommend if you're starting with the compiler. Plus plus I'm thinking it's possibly better than g. plus plus. And so on.

636
01:20:25.979 --> 01:20:31.949
And then what we're doing is we're migrating into and video with.

637
01:20:31.949 --> 01:20:36.270
And so on slowly, you want to read ahead of me.

638
01:20:36.270 --> 01:20:40.409
Read the next tutorials and have fun.

639
01:20:40.409 --> 01:20:47.399
Okay, so that's enough stuff for today. If anyone has any questions, then.

640
01:20:48.569 --> 01:20:54.329
Hey, professor, can you go over reductions again? And I kind of missed that.

641
01:20:54.329 --> 01:21:03.989
Sure, so this is on parallel.

642
01:21:03.989 --> 01:21:07.199
And.

643
01:21:07.199 --> 01:21:11.670
Silence.

644
01:21:11.670 --> 01:21:16.560
So, what we want to do here, let's ignore the pragmatic.

645
01:21:16.560 --> 01:21:20.310
We want to sum up something. Okay inside a loop.

646
01:21:20.310 --> 01:21:25.800
Like, we want to say, sum up the variable I, or something more complicated.

647
01:21:27.090 --> 01:21:36.390
And we want to do it in parallel. Now, the problem is that there's that global, total variable computed and we, if we access it in parallel.

648
01:21:36.390 --> 01:21:44.579
You know, the different threads will try to write to it. You see, now, the way it is this implement, as you read, you add and you write back.

649
01:21:44.579 --> 01:21:47.670
And they step on each other.

650
01:21:47.670 --> 01:21:51.270
And you're going to get the wrong answer so.

651
01:21:51.270 --> 01:21:55.350
Silence.

652
01:21:55.350 --> 01:21:59.520
See, this is oh, that was right and do.

653
01:21:59.520 --> 01:22:09.899
Well, if I did not have the reduction, I would get the wrong answer here. So I could say.

654
01:22:18.300 --> 01:22:24.329
Silence.

655
01:22:33.899 --> 01:22:38.880
What no, that's my laptop. Okay, so.

656
01:22:43.140 --> 01:22:47.970
Silence.

657
01:22:52.050 --> 01:22:57.300
Come on.

658
01:22:58.949 --> 01:23:02.460
A, while to run.

659
01:23:03.539 --> 01:23:08.250
Okay, she is the correct answer and the wrong answer.

660
01:23:08.250 --> 01:23:13.199
And I haven't mentioned somewhere on here.

661
01:23:15.750 --> 01:23:20.880
You see this problem with the addition, he said the 2 threads step on each other.

662
01:23:20.880 --> 01:23:24.149
So, the answer is to.

663
01:23:25.229 --> 01:23:38.819
If we have this, then this will then it will get compiled as each separate thread will have a local copy of the computed total variable. So each thread will be.

664
01:23:38.819 --> 01:23:43.199
Summing into a local sub, total variable so there's no problem.

665
01:23:43.199 --> 01:23:50.310
And then, at the end, all the local sub, total variables will all be some together to make the global computed.

666
01:23:51.689 --> 01:23:55.680
So this can run in parallel, but we get the right answer.

667
01:23:57.359 --> 01:24:03.960
Does that yeah, that because it's different than.

668
01:24:03.960 --> 01:24:07.710
Like, critical or atomic or something where right.

669
01:24:07.710 --> 01:24:13.949
Well, this is more limited in what it can do. You can reduce only a small set of fixed operators.

670
01:24:13.949 --> 01:24:19.170
Some, the other example for open AC was a max.

671
01:24:19.170 --> 01:24:22.170
So, there's a fixed limited set of operators.

672
01:24:22.170 --> 01:24:26.850
But if that's 1 of the things you want to do.

673
01:24:26.850 --> 01:24:32.789
It does it really fast gotcha. Thank you. The atomic is.

674
01:24:32.789 --> 01:24:40.409
More general, the following statement after an atomic again, it's limited into what the allowed instructions are.

675
01:24:41.430 --> 01:24:49.979
But it's less limited than a reduction and it's slower than a reduction, but still pretty good. The the critical block.

676
01:24:49.979 --> 01:24:56.189
You can put anything you want in the critical block, but there's a big overhead just to start the critical block.

677
01:24:56.189 --> 01:25:00.630
Mm, okay.

678
01:25:00.630 --> 01:25:04.350
Other questions.

679
01:25:04.350 --> 01:25:07.500
Silence.

680
01:25:07.500 --> 01:25:12.510
If not see you Thursday time for lunch.

681
01:25:12.510 --> 01:25:15.689
Sorry, I actually have another 1. sure. Go ahead.

682
01:25:15.689 --> 01:25:19.770
From homework to, or the question about.

683
01:25:19.770 --> 01:25:23.789
Cuda cores until the on.

684
01:25:23.789 --> 01:25:28.409
I wasn't sure about that. 1. can can you kind of go over that? Sure.

685
01:25:28.409 --> 01:25:32.189
Silence.

686
01:25:32.189 --> 01:25:37.770
Here yeah, yeah, it's on 1 of the handouts.

687
01:25:37.770 --> 01:25:42.210
But the Intel on core, it's Super scaler.

688
01:25:42.210 --> 01:25:53.310
You can be running 2 hyper threads and then you can do maybe a floating ad, a floating multiplying integer add some sort of conditional test all in 1 loot all in 1 cycle.

689
01:25:53.310 --> 01:26:00.090
So, in 1 cycle on the Intel Z on, you can do several operations.

690
01:26:00.090 --> 01:26:04.079
Whereas the could a core in 1 cycle. It's.

691
01:26:04.079 --> 01:26:13.109
Much less, it may be even cannot do a floating point operation for every could for every thread on the on the.

692
01:26:13.109 --> 01:26:19.170
So there because there's fewer floating point.

693
01:26:19.170 --> 01:26:23.909
Units on the GPU, then there are actually could of course.

694
01:26:23.909 --> 01:26:29.789
So that's why so I estimate that a could a core is.

695
01:26:29.789 --> 01:26:34.260
5% as Z encore.

696
01:26:35.850 --> 01:26:43.260
Thank you and that said, if you got 4000 CUDA cores, that's still faster than.

697
01:26:43.260 --> 01:26:47.250
You know, 20 Z on cars, but.

698
01:26:47.250 --> 01:26:51.899
Oh, by the way, this is an interesting design issue mentioned the more later, but.

699
01:26:51.899 --> 01:26:58.020
So, when invidious designing a, they have to decide.

700
01:26:58.020 --> 01:27:03.300
How many floating point processors to put on the GPU.

701
01:27:03.300 --> 01:27:12.744
And how many double precision that's a separate chunk of hardware single double received and as they go from generation to generation and video keeps changing things around that.

702
01:27:13.645 --> 01:27:25.314
So, from capital to Maxwell, they reduced they especially reduce the number of double precision cores, a lot of research. And what a lot slower, and then following generation, they reverse their decision somewhat.

703
01:27:25.619 --> 01:27:37.979
And then 1 of these generations they brought in half precision floating point, 16 bed float. So if you had a half precision floater went very fast. But they, but he had double positioned folder. It went a lot slower. So.

704
01:27:39.000 --> 01:27:43.319
It's a design decision that the hardware designers make as to how much.

705
01:27:43.319 --> 01:27:49.380
You know, how much area on the Silicon to add to allocate to the different functions.

706
01:27:49.380 --> 01:27:57.659
And the chips still computes the right answers, but it's how much time it takes to the different functions.

707
01:27:59.489 --> 01:28:03.000
Other questions.

708
01:28:04.380 --> 01:28:07.680
Anyone else okay.

709
01:28:08.909 --> 01:28:12.270
And I'm not going to bother saving this chat window dressing in it. So.

710
01:28:20.279 --> 01:28:21.390
Hello.