A collection of unsettling images (these unsettling photos indicate that AI has become smarter! It i
在全球所有 AI 模型中,OpenAI 的 GPT-3 最能引发公众的遐想。 虽然它可以仅凭很少的文本来输出诗歌、短篇小...
Among all AI models around the world, OpenAI's GPT-3 is the most imaginative among the public. Although it can output poetry, short stories, and songs with only a small amount of text, and successfully convince people that it is human creation, it still appears very "naive" when talking to humans.
However, technicians still believe that the technology that created GPT-3 may be the necessary path to higher level AI. GPT-3 has been trained using a large amount of text data. So, what would happen if both text and image data were used for training? The Allen Institute of Artificial Intelligence (AI2) has made progress on this issue by developing a new visual language model that can generate corresponding images based on given text.
Unlike the surrealist works generated by GAN, the images generated by AI2 may look very strange, but it may indeed be a new path to achieve universal artificial intelligence. AI "Problem Maker" GPT-3 belongs to the "Transformer" model in classification, and with the success of Google BERT, this model has become popular.
Before BERT, the usability of language models was poor. Although they had some predictive ability, they were not sufficient to generate long sentences that were in line with grammar and common sense. BERT significantly enhanced the model's ability in this area by introducing a new technology called "masking".
The model will be asked to complete a fill in the blank question similar to the following: This lady will go___ Exercise, they bought one___ The original intention of the idea of bread making sandwiches was to force the model to perform millions of such exercises, whether it is possible for it to learn how to combine words into sentences and how to combine sentences into paragraphs. Test results show that the model has indeed gained better ability to generate and interpret text (Google is using BERT to help provide more relevant search results in its search engine).
After proving the effectiveness of masking, technicians attempted to apply it to visual language models by hiding words in the text, such as:
A standing by a tree___ (Source: MIT TR) Through millions of times of training, it can not only find the composite pattern between words, but also find the relationship between words and elements in images. As a result, the model has the ability to associate text descriptions with visual images, just like human infants can establish a relationship between the words they learn and the things they see.
For example, when the model reads the following image, it can provide a more appropriate title, such as "Women Playing Hockey". Alternatively, they can answer questions such as' What color is a ball? 'because the model can associate the word' ball 'with circular objects in the image.
Figure | Women's Hockey Competition (Source: MIT TR) - A picture wins a thousand words. Technicians want to know if these models have truly "learned" to understand the world like babies. Children can not only associate words when seeing images, but also visualize corresponding images in their minds when seeing words, even if the images are a mixture of real and imagined.
Technicians try to make the model do the same thing: generate images based on text. Then the model spit out meaningless pixel patterns.
Is it a bird? Is it a plane? No, this is just a "masterpiece" produced by AI (source: MIT TR). There is a reason for this result, as the task of converting text into images is much more difficult than other tasks. Ani Kembhavi, the head of AI2's computer vision team, said that the text does not specify all the content contained in the images.
Therefore, the model needs to "associate" many real-world common sense to fill in the details. For example, if AI is asked to draw "a giraffe walking on a road," it needs to infer that the road is more likely to be gray rather than pink, and is more likely to be adjacent to grasslands rather than oceans - although this information is not clear.
Therefore, Kembravi and his colleagues Jaemin Cho, Jiasen Lu, and Hannaneh Hajishirzi decided to see if they could teach AI all these implicit visual knowledge by adjusting the mask. They trained the model not to predict the covered words from the corresponding images, but to enable it to "brain fill" the missing parts in the images from the text.
Although the final image generated by the model is not entirely realistic, this is not the point. Importantly, this indicates that the model has incorporated the correct advanced visual concepts, that is, AI has to some extent developed children's ability to draw from text.
The ability of the AI2 model to generate images based on text generation examples (source: MIT TR) represents an important step in AI research, indicating that the model actually has a certain level of abstraction ability, which is a fundamental skill in understanding the world. In the future, this technology is likely to have a significant impact on the robotics field.
Robots can communicate using language, and the better their understanding of visual information, the more complex tasks they can perform. Hajishirzi said that in the short term, this visualization can also help technicians better understand the learning process of AI models. After that, the AI2 team plans to conduct more experiments to improve the quality of image generation and broaden the visual and linguistic aspects of the model.
当前非电脑浏览器正常宽度,请使用移动设备访问本站!