top of page

Raccoons In Top Hats: OpenAI and their invention CLIP

hey this is john and ben welcome to automatic's youtube channel today we're going to be talking about the history of open ai and a little bit about one of their inventions called clip

i'm gonna start your five minutes you all ready i'm ready do it start all right so open ai um admittedly i'm getting some of this stuff off of wikipedia but some other places too i want to get to open ai's mission because it's crazy and it's something that i didn't realize until i actually looked into this but let's go back and talk about sort of the history and the facts first so um i don't know how many people know this but actually probably a lot of people that follow my other youtube channel but uh elon musk is one of the co-founders of open ai and it's only been in existence since december 11th of 2015 so not even seven years yet and holy crap they've done a lot of stuff in seven years.

OpenAI was founded by Elon Musk and Sam Altman. Sam is somebody who might you might not know as much about, but he's actually the founder of y combinator.

He has started up i mean holy mackerel i think it's like coinbase airbnb like you name it they they were like the um accelerator that did all this stuff and they started in 2005 so it really wasn't all that long ago that all this stuff happened so anyway so these are two really big hot shots and they gave openai and plus some other investors but anyway collectively they gave open ai over 1 billion dollars to start up so boy i wish we had that kind of funding right nice that would be nice and then even better than that so they were started as a non-profit and this is kind of confusing because there's open ai lp and then there's open ai inc and openai inc owns openai lp and openai inc is non-profit but open ai lp is what's called capped profit so that they can actually take investments from other companies and i think what forced that to happen was that microsoft wanted to invest one billion dollars into them in what was it 2019. in 2019 microsoft wanted to invest a billion dollars into open ai so they kind of change things around so that they're sort of for profit so it's sort of a weird status that they're in uh but anyway i'm gonna get to their mission statement and the reason why they're that at this point they are um they're based in san francisco and interestingly enough they are in the same um building which is the the pioneer building in san francisco which also neuralink is in and that's another elon musk company so when people talk about musk's companies obviously tesla and spacex but then also neurolink and openai and the boring company which is the tunneling thing so there's a lot of companies that he owns and you can kind of see this collective vision and basically um what they wanted to do was to create a just get the highest like the best ai people in the world and andre carpathi by the way was at open ai before he went to tesla so he was one of the dupe yeah so he was one of the super hot shots at uh open ai that then went on to tesla but anyway apparently really like top level ai researchers are paid higher they said higher than uh than nfl players so they have really really high salaries apparently open ai doesn't quite give them that level like google facebook level of money but they give them a lot more freedom anyway so that's some historical stuff i want to forgive the train sorry there's a train going by but i really want to get to this um which is can you see that yep i can see okay so this is the crazy thing this is their mission statement open ai's mission statement is to ensure that artificial general intelligence benefits all of humanity and artificial general intelligence is the kind of crap that you think about when you think about like terminator or um oh gosh what are some other you know future history movies but anyway x-machina. yes x-mac oh that's a really good one that's a terrifying movie so but anyway this is kind of crazy that their mission is that big and that's

what they're trying to do is to actually ensure that Artificial General Intelligence benefits all of humanity

and i want to get here to their charter that's what i really wanted to talk about and i have just about one minute left oh crap okay so basically they want to um they want to do this and they said also if somebody else is beating them they're they will then turn around and help that other group of people so the idea is to direct ai to become as friendly and nice to humanity as possible they commit to deployment of all you know all of their stuff uh their primary fiduciary duty is to humanity which i think is really cool they're committed to doing the research required to make a agi safe and to drive the broad adoption and then you know they what they say is in order to have effective leadership they can't just try to set policy they actually have to be in the forefront of creating agi and that they'll cooperate with other people and in the last 20 seconds i just wanted to say if you know elon musk and all of his companies and their mission statements you can see how open ai actually is part of this vision of creating a future that's good and makes people optimistic and so i think that's really really cool i had no idea until i started looking into this that that was their mission statement but i think that's amazing and that's five minutes right

yep that's five minutes there we go all right so yeah what do you what do you have to say about that that was a lot of information like dumped at you real quick yeah what actually the the company structuring makes a lot of sense um because there are some things that they're just releasing absolutely to the public for free open sourcing it like clip which i'm about to talk about um but then they have other things like dahle which is an image generator which is actually uh they kind of keep behind a paywall in hosted on their sites um so i'm sure there's a lot of uh corporate structuring stuff around those decisions well it was i didn't i didn't get into it because i ran out of time but of course you know they developed gpt gpt2 gpt3 they're now working on gpt-4 which is generative pre-trained trans uh gender pre-chain transporter yeah but anyway that's the gigantic model that sort of broke the internet and changed things around for for the better but that when they produced gpt2 even though their mission is to share all of their stuff they said that they wouldn't actually share the the model because they were trying to be careful about releasing it to everybody uh because they were worried that some bad players would be involved and so i think that's part of the argument for dolly dolly too as well um so yeah makes sense um i think there's also a there might be a financial incentive to keeping it close yeah but yeah that does uh there's there's a lot you can uh generate and there's a lot of things you wouldn't want mass generated but that's a whole other ethics debate yeah that's a that's a huge thing and you know it is interesting that there was this division because they were definitely not for profit until 2019 when microsoft wanted to invest a billion dollars but now they are kind of for profit and so i think that you know now it's like well wait a second if we release all of this stuff then how do we make any more money so you know so i think they do have some financial motivation and and i actually have talked about that before in another video about how um dolly 2 is kind of like a flywheel because you have humans interacting with a giving input and saying this is what i want and then clicking the ones that they like and then that gives an indication to the ai which ones human beings think are like the best produced product for that input and then they can use that to generate more data so they can actually create a commercial flywheel and people are paying 15 bucks a month for it so there you go so yeah then they keep a hold on that data that they're creating through this flywheel right so yeah so pretty classic it's interesting because a lot of people don't think about elon musk as involved in open ai as much but he did quit the board when they before microsoft invested because he said it would be a conflict of interest uh with tesla i think so anyway so he did he's not he's just a donor at this point and i'm sure has some stock or whatever the equivalent is in the company but kind of hands off it's it's certainly not a publicly traded company no but i'm sure there's also some sort of private i mean if you if you invest a billion dollars yeah yeah it's not a once not an llc exactly so yeah interesting that is very interesting um was there anything else you wanted to say about the agi stuff i just found it interesting because i thought you know their mission statement would be something like hey we want to create the coolest toys and the best like artificial intelligence but to talk about specifically general intelligence that's you know up until like narrow ai like what we're going to talk about with clip next is is pretty sm it's it's really amazing what it does and it's better than human beings but it's not a replacement for a human being and agi kind of trends into that realm of like this thing understands itself and it knows what it is and it's like like a conscious being and so i didn't realize that their mission was quite that like like it's a pretty darn big mission you know but again it makes sense for elon musk because he's that kind of guy you know he's like life 2.0 on mars and making the whole world electric with tesla so it makes sense that you would have that kind of ridiculously large vision yeah i think it also makes sense why um they they tend to do these big projects that are kind of part of the human experience right image understanding and uh text synthesis those kinds of things that are part of agi that you have to build together and figure out how they're going to work together rather than doing something a little more specific like what we're doing at automatic yeah very commercialized ai very specific things for a specific um group of people right and you know i mean certainly gpt just started off as a research paper i think it was it was not that long ago it was 2018 with gpt one which was you know the original research paper for it so uh so you know i think they're just like they're pure research right they just try stuff and maybe it'll work and maybe it won't but because they're such smart people oftentimes i think it works but but then they've got this kind of interesting little commercializing angle that they've added to the whole thing it'd be interesting to know like how how many different things they're trying and kind of failing at compared to the things that are really succeeding because i'm sure just the pure genius of the people that they're hiring i'm sure they they have tons of good ideas every day but you only hear about a few of them right yeah that could be true and who knows what's cooking right now that could come out in a year or two yeah or ten years yeah exactly that's actually true i mean they may have people who are like really smart and have ideas but they're just not possible yet because they're so advanced so yeah interestingly enough i think as of 2021 they only had 120 employees so it's not a big company at all oh no not at all so i would imagine you know probably given elon musk's sort of bent it's probably like 80 of those people are actually active ai researchers and 20 are like support so yeah that makes sense yeah so anyway i mean that that could be like 100 of the best ai researchers in the world that's a pretty good you know it's a i guess a football team's a good comparison because there's usually like 60 or 80 like players who are part of the team and so that's something yeah yeah you got your managers and um people that are taking leads on different projects right i guess the uh the football analogy fall short there but right yeah that's true there's there's no coach oh by the way sam altman the y combinator guy is the ceo of openai so you know he's okay he has not taken a hands-off approach he is fully fully invested so he has some good youtube videos if anyone's interested

Sam Altman interviews Elon Musk on how to build a better future

i think he interviewed elon musk one time he's just a really intelligent guy yeah okay i'll have to check those out that's cool so speaking of all this stuff um clip which is i don't if it's an acronym for something i can't remember but anyway it's a image to like text sort of processor so if you think about typing in text and getting an image this is the inverse of that right this actually takes an image and gets text out of it is that correct all right well all right let's give you five minutes to tell me let's start the clock up all right let's start the clock you ready okay here we go three two one go all right so clip stands for contrastive language image pre-training but it's they always have to have free training in there it's a good sounding acronym but uh the the jumble of words is too much there uh what it really does on the surface is connects text to images it tells you the similarity between some images and text it it doesn't necessarily generate the text for you but from an image and a group of texts like a set of input text it can tell you the most similar uh the prediction of similarity between all of these um so you can see with this picture of a dog it knows it's a king charles spaniel compared to um a cocker spaniel which is so in other words it's better than me yeah yeah exactly it's that's pretty incredible um black chin hummingbird it it wasn't perfect at this one but they were pretty honest in their research showing just randomly chosen results which is pretty noble so what it does is it takes an image and encodes it which is basically smashing it down into uh just some numbers i might talk about encoding in another video because that's a whole thing on its own and it does the same with a group of texts it tokenizes them into numeric values and then puts them into an encoding that the machine can actually learn and then it uh it creates this similarity matrix to uh tell you which uh which combination is is the most similar uh it's trained using a set of pre-trained image to text pairs and i think it was 400 million text to image pairs for the data set so it'll randomly pick out maybe like 256 uh images and their text pairs that was pre uh pre-made pre-assigned um and then it attempts to

it creates this matrix where it attempts to improve the similarity between the input image and its text and reduce the similarity of that image to every other text is that you following me so far i think so so does it like can it cheat it actually knows ahead of time what the right answer is in these while it's training um that's the uh the the image to text datus yeah draining yeah so the the loss value would know your yeah right the optimizer would know so once you have this trained clip model you can take something like uh a a huge set of text and an image and you can say which one of these texts is closest to these image um so you can imagine this could be like a thousand different uh words over there playing card dog train house um and it will come up with the text that's most similar to that input image all right so you might be wondering like how is this different than just your typical classification image classification algorithm um so clip is a whole lot more flexible you don't need to know your uh categories ahead of time it can tokenize really any text phrase to a certain extent it can it i think it's like a paragraph um you can imagine that it can it can turn that into an encoding um which is by the way oh man i'm not gonna make it we can cheat and give you a little extra time um so it's it's very flexible is the big thing um i'm gonna go through why i think it's super cool uh first it's open source so anybody can use it and that was awesome openai to do um so there's there's quite a few other algorithms that are relying on it right now that we're seeing come out in real time um okay that's my time but there's your time so hey who loves cool items let me ask you can you tell me more about this oh yeah there we go the second cool thing uh is that it's it's just super good at captioning uh like we haven't seen before it's probably the best captioning tool going from image to a caption descriptive caption that i that i'm aware of so you could imagine i'm not sure what it's called but i think like blind people have those devices that help them decipher what things in the physical world are right um so you could put something like that on a computer so they could hover over an image that doesn't have the alt text attached to it and just be able to say like here here's an image convert it to text immediately obviously it's not perfect right now the the third really cool thing about this is it works super well on captions that it has not seen before you could give it something completely new that no one has ever thought of before like a raccoon wearing a top hat uh smoking a cigar and you know that's going to end up getting overlaid over this right now because i'm going to have to go to search yeah

there's the thumbnail for the video exactly um you can give it something that it's never seen before but it has an ability to reason uh from breaking apart this text into its different encoding pieces and then looking at similarities of maybe even new images that it's never seen before wow i'm going to be referring back to clip a lot in future videos yeah we're going to sort of build these things up i think so that's actually cool so um i don't know do you want to stop sharing screen and we could continue talking about it because i have some questions about this uh cool see if i can well yeah i mean the first one is um so i think for a lot of people who might not understand their you know they might think like okay this is just like a convolutional neural network that does classification but the way that that's structured is that you like the imagenet thing has a thousand categories and each category i think is just a word but i mean it's a number associated but it's just a simple thing like dog cat whatever um this is able to tokenize and understand like you said like a several words a sentence paragraph like a lot of text so it's much more capable than a um than a traditional convolutional neural network i think the part that i'm still confused about is when you give it these image text pairs are those pairs that like there's an image of like this the king spaniel or whatever that thing was but that's like a somebody actually wrote the the text of that down yeah that's pulled pulled from um imagenet's uh massive data set of already captioned images um yeah somebody had a caption

there's probably a pretty good story there yeah we might have to do some research on that and how they actually accomplish that because it's it's an incredible um data set of already captioned images um you know i'm not i'm not quite sure all that they used uh but yeah it's all it's all pre-captioned images uh in a variety of them too and i think people are continuing to fine-tune clip with their own data sets i want to say the the stable transformers uh stable diffusion sorry one of one of the newest models that have come out with clip uh i think they used pinterest for a lot of their training interesting okay yeah because people generally tend to to like put something underneath like uh yeah and just the pinterest board is a bunch of images with little titles that are descriptive so wow it just i think part of what blows my mind is not just how cool all this technology is but the fact that we have access to 400 million image text pairs that's ridiculous yeah it's crazy yeah and and they found a way to make use of that because there's there's a whole lot of data out there on the internet that is mostly unusable for trading ai and up until now with some of these newer models we couldn't use these image-to-text pairs because they weren't perfect enough like you said with a typical convolutional layer or convolutional network it might have just had a thousand or ten thousand outputs and those ten thousand outputs were like output one what's the likelihood that this is a car it wasn't a photo of a car it wasn't a photo in vintage black and white of a yeah yeah it was it was very specifically car but with with this embedding um it really it expanded how we could use this data yeah i think that's the biggest thing even if people don't understand exactly how convolutional neural networks and stuff work they were i mean they were super cool i mean it's just still so awesome that you could take a black and white photo or a color photo or the car in a parking garage or the car at an auto show and it still knows it's a car i mean that is really cool but this is so much more specific because it's like i want a red car in a parking garage at night and it will you know find that thing or it will you know or it will generate the clip or the text that will say that if you show it a picture of it in this case so um yeah it's pretty so i imagine at this point that they're probably able to do a lot of bootstrapping to create more labels because you know now that you've got something that works pretty well i can imagine that you can give it pictures that don't have labels and it hopefully is labeling them decently well yeah make more you can make more uh image to text labels of your own which might be what they're doing with dali um on some form is when people generate images on their site people choose which are the best images that i like the most that fit this prompt um open a i could mark that set it aside as an image text pair to train on in the future right yeah that was kind of what my imagination was i was like look at this flywheel they're creating yeah they've got us uh all the stupid humans are like oh let's do this and we're like we're just feeding it more and more data so yeah now i think you're totally right i just think the the clip stuff grounds it a little bit more in this very specific example doing that yeah it's just that's that's really cool that they can do like a image embedding and a word embedding and they can just smash it so in other in like for the stuff that they give it for training they just so they have the correct label like the text that went with it and then they just give it a random assortment of some 200 would you say 256 other i guess they just pull random labels from other stuff yeah well i think they so the the top row was all of the image embeddings okay and the the left row was all of the text embeddings and then your first image and your first text will match up and you want that high similar uh i think image two text two you'll want those to be the most similar image three text three and that's where that oh so the diagonal okay so you're training on a batch so you might get like 256 of these image pairs at a time but then what it wants to do is like all of the so image image one text two should be dissimilar so it also needs to learn to reject that possibility yeah yeah so it ends up with this pretty cool loss function of right uh increasing similarity of certain items but decreasing similarity of other items which makes the terrain a whole lot better yeah yeah just a little bit can you actually have you dug into the math of it like how they're actually structuring the loss function no i couldn't tell you

so so the model itself is available or the actual code to generate the model so you could like retrain it yourself oh i think i think it's all available oh it's all okay wow so it really is open source wow that's super cool although you'd need a pretty uh intense computer yeah yeah i was looking at um so gpt-3 if i remember is 174 billion parameters and DALL-E one i don't know what dolly two is but dolly one was a 12 billion parameter like mini version of gpt three uh so i assume dolly two is some bigger subset of gpt3 so that's imagine yeah yeah so interesting parameters yeah but but yeah it takes it takes more than a home computer to run this stuff on i i got it running on my local machine but i got it running with batch sizes of about four so four images compared to four text and it works but there's only so much the machine can learn compared to take i i said 256 for example but it could have been um like a thousand images and a thousand text pairs and there's just a whole lot more that the computer can learn per batch with right with that much more um interesting choice involved well we need to fire up all those amazon aws stuff and try to try to do macro batches instead yeah there's actually some stuff on the wikipedia page about how many i think they had to rent google's oh gosh i don't know if i'll find it like online but they had to rent like google's um uh one of their entire clusters for like a month to do gpt3 training it was something outrageous like that i think so anyway okay yeah see where their billion dollar investments go yeah i mean i think it was over five million dollars to train it and that was the that was the successful run i think they also did a bunch of trial runs before that so you know so you know i think before you get to the one that actually worked they probably had like a hundred that didn't so they probably spent way more than five million dollars trying to get it to work before they got it to work so yeah that and the man hours involved oh gosh yeah anyway pretty cool stuff all right so i i guess that sounds good we wow we've talked a lot longer than five minutes on this stuff but this is super interesting stuff i hopefully folks will have questions in the comments for us to answer or if you if there's a piece of this like you totally don't understand you're like what the hell was that that by all means ask because we can totally do you know an episode on that specifically we that's the beauty of youtube is that we can be responsive to this and actually kind of recast it in what you're interested in so and um check out our our blog on we can we can go into a good bit more detail right writing we can be more precise there right so almost bigger content so by all means ask away and uh yeah and thank you for watching and i don't know but let us know what you think all right like and subscribe there you go oh thank you like always like you would not believe how important it is because speaking of ai youtube's ai is like all about the liking and the subscribing so all right it's one of the things when you research this area it's kind of weird it's so meta because you're like oh you know it's like i know exactly what youtube's algorithm is doing under the hood it's like ah these people are paying attention to this stuff so there you go anyway all right cool yeah thank you so much for watching and everybody have a good day we'll see you later bye

32 views0 comments


bottom of page