OpenAI’s DALLE-2 image diffusion model: Extending Diego Velázquez’s Las Meninas (1656)

Digital technology tilts towards an economy of abundance not scarcity. And, with significant new developments in AI technology (notably large language and image diffusion models), we are potentially witnessing a shift not only to a new economy of knowledge, but one in which knowledge is itself altered.

The nature and competitiveness of knowledge-based economies and the shift to creative economies have long been debated. Unpinning these debates are two main considerations: Firstly, innovation can be understood as the development of value from creativity and invention, and which beyond mere incidental, localised improvements can be determined, as Margaret Boden describes in The Creative Mind, as ‘historical creativity’, i.e. forms of innovation as paradigm-changing. Secondly, and perhaps more fundamentally, is the importance of information, or ‘knowledge about knowledge’, as famously characterised by Peter Drucker in his book Post-Capitalist Society.

An information-based economy overturns the problem of scarcity, which Drucker argued would lead to increasingly niche organisations seeking to specialise for competitive advantage, because capital (in this case knowledge) will be freely available. The digital turn, our ability to ‘cut and paste’ information effortlessly, is symptomatic of what Drucker had predicted. The latest advances in AI technologies arguably represent a further change, a potentially profound form of historical creativity; profound because it binds within it the means of creative knowledge production itself. This article focuses on one specific new technology, AI image diffusion generators, which provides a means to explore not just the impact on the knowledge economy, but upon knowledge itself.

AI Image Diffusion

Just this year a plethora of articles have been circulating regarding AI image generators, based upon new ‘decoder diffusion models’. The most widely reported has been OpenAI’s DALL-E 2, along with Google’s IMAGEN (other smaller scale and/or open access models include: Midjourney, Nightcafe, Starryai, and Dalle-mini). In an article for The New York Times, Kevin Roose notes that what is impressive about a model such as DALL-E 2 is not simply that it can generate new art, but how it performs this task: ‘These aren’t composites made out of existing internet images — they’re wholly new creations’.

At the time of writing, the beta-testing of the technology is still relatively modest, but nonetheless represents a paradigm shift. The diffusion model works by progressively corrupting (diffusing) training data, until data becomes pure noise. The model then trains a neural network to reverse the process. As Jonathan Ho (at Google) explains: ‘Running this reversed corruption process synthesises data from pure noise by gradually denoising it until a clean sample is produced’. While the model involves a form of reverse engineering, as found with the adversarial layer of GANs, the technique is categorically different and is fast overtaking the use of GANs. Indeed, as Carlos Pérez notes: ‘GANs have been surpassed by Diffusion models that are orders of magnitude more efficient’; not least as the underpinning decoder technology ‘is an Ordinary Differential Equation … that can be solved by many numerical methods developed in the past decades. It requires no training!

Importantly, two key factors will allow for more widespread, ‘off-the-shelf’ use. Firstly, these models work in two-parts, moving from a sentence to an image. They are layered upon increasingly impressive large language models, providing a powerful and accessible interface, which allows the user to request image generation based simply upon natural language sentences. Secondly, the diffusion method, which as noted first reduces data to noise before then building up an image (effectively from scratch), provides the means for the original generation of imagery, thereby introducing a new creative property to computer vision. The real ‘art’ of this model is its ability to form predication models of pixels in a similar manner to predicting words in language models. Despite the subtlety of imagery, which do not adhere to ‘grammar’ in the sense we would say of language, this technique is effectively working in the same way to text prediction; i.e. by training models on the smallest units of meaning (whether words or pixels). Furthermore, the integration of large language models and image diffusion knits together both word and image making these models very simple to use. In so doing, these models solve the apparent ‘problem’ of the complexity of images, which human culture has obsessed over, over millennia, and which computers can now format quickly as information.

Automatic Art

Figure 1: Images rendered by DALL-E 2, based upon the text entry ‘portrait of chess player, pencil drawing’

Figures 1 and 2 give a demonstration of image generation with DALL-E 2. In each case the rendering took less than 15 seconds. Figure 1 shows the result of the simple text prompt ‘portrait of chess player, pencil drawing’.

Figure 2: Images rendered by DALL-E 2, based upon the text entry ‘portrait of Go player, pencil drawing’

Figure 2 shows the result of the prompt ‘portrait of Go player, pencil drawing’. In both cases, the reference to ‘pencil drawing’ prompts for a specific medium. A wide range of terms can be applied in this way to help narrow down the type of image required. It is notable that in the case of chess, the reference to ‘player’ seems to betray a gender bias with all results showing men. However, the quality of composition in each case is striking. The chess board itself is only implied, which focuses attention upon the ‘portrait’ of the players, who each look poised and thoughtful. We can read intensity and drama in these images. In the case of Go, one woman is represented, and the players all appear of Asian descent, which no doubt reflects the popularity of Go in Asian countries (a whole study can be made of these biases, which indeed is vital work going forward). Similarly, with the Go players, we see studied portraits, with a real sense of a game underway. It is important to remind of the fact that these images are completely new. They are not direct copies or even adaptions of existing images.

Figure 3: Chess player image, shown here (far right) to have been edited, keeping to the same style. Figure 4: Go player image, shown here (bottom) to have been extended both to the left and right of the original, keeping to the same style.

Figure 3 and 4 show secondary processes, where, for example, elements of a generated image can be erased to then prompt further adjustment. Figure 3 shows how the chess pieces in the original image have been seamlessly edited out; such rendering is very swift and involves no image editing skills on behalf of the operator.

Figure 4 shows a more recently added technique to prompt extensions to the original image. In this case, the Go board is extended to the right, and on the left, the woman’s elbow is draw in.

Despite the infancy of the technology, its impact is already evident. In a promotional video for the Google Pixel 6 Pro, production designer Hannah Beachler (working on Black Panther 2), talks through her creative process, which, involves building up huge collections of images for moods, tones and storytelling:

Beachler uses the new Pixel 6 Pro, and particularly the ‘magic eraser’ feature (similar to that shown in Figure 3) that dramatically accelerates her process not only for collecting, but also testing how scenes, styles and settings might work. It is a slick corporate video, part of a marketing campaign for the new phone, so inevitably shows the technology in its best light. Yet, still, it is an example of how the new image diffusion model technology has moved quickly into ready-to-use devices and apps. It presents the possibility of accelerating creative practices, and arguably presents a new mode of creativity. As expected of a promotional video, the case is made not for automating creativity, but providing a new set of tools that allow (human) creatives to work quickly, affordably, and intuitively.

At the time of writing, Adobe announced significant updates to Photoshop, including a beta feature called ‘Backdrop neural filter’, which like DALL-E 2 and other models, enables you to ‘create a unique backdrop based on a description’. The annual Adobe conference focused on AI for the social good and from creator-centric perspective. Chief product officer, Scott Belsky, pointedly described AI only as ‘your co-pilot in creative endeavours’. Nevertheless, in the same way that digital HD video cameras (and later cameras on phones) led to a reduction in journalistic teams (and the rise of the individual Vlogger), a case can be made for new AI technology changing studio processes. A generous reading is that the creative potential of image diffusion presents the possibility of new modalities of creative practices and outputs.

While DALL-E 2 and IMAGEN have been used so far to generate still images, moving image models are quickly emerging. Apple’s GAUDI model, for example, turns text prompts into 3D scenes:

It is in the domain of moving image that perhaps rapid take-up and innovation will be most readily noticeable. 3D and moving image models have far reaching ramifications for how the creative sector might look again at the implementation and marketing of existing technologies such as VR and AR, which to date have typically required high level (and costly) digital renderings. The new image diffusion models are set not only to provide powerful, efficient and swift tools for visual media producers, but equally present the possibility for a new wave of user-generated content. So, for example, in the same way that the virtual space of Minecraft prompted a massive user community (and a number of prominent online player-celebrities), there is a strong case for image diffusions models (which as noted only need simple natural language prompts) enabling a rapid rise in new types of user generated content. To put it simply, the technology is not far off from the ‘Holodeck’ of the television franchise Star Trek. The serious point is that a shift is occurring in how digital media is not only coded and copied, but also now rendered, prompting serious considerations for the knowledge economy and innovation cycles.

The Fate of Innovation

AI models can be extremely swift and efficient to use as an end-user. When typing, the auto-correct feature seems effortless; when unlocking a phone via facial recognition, the process takes seconds; even the rendering of entirely new images via DALL-E 2 takes less little than 15 seconds. Yet, each of these technologies must be ‘trained’ (albeit using self-supervised methods), which is where the massive capabilities of supercomputers are required, and which remain outside of the means of individuals and indeed most businesses. The sheer scale of AI computing and its necessary infrastructure, set against the fast, ubiquitous products and algorithms it can provide presents a split view of what this can mean for capitalist modes of production and for the knowledge economy.

Paul Mason’s Postcapitalism and Shoshana Zuboff’s The Age of Surveillance Capitalism are each canny studies of the current economics, which draw out two reinforcing principles of info-capitalism. Firstly, information bears near-zero production costs, which in turn requires a shift away from the search for competitive advantage (fuelled by innovation) towards the need to enforce advantage through legal, ownership rights etc (which leads to monopolies).

Mason paints a vivid picture of the contemporary knowledge economy with the example of the jet engine. Initially developed during the Second World War, the first fifty years saw only half a percent efficiency increases per year, yet more recently this has leapt to 65% efficiency. The change has come with new aircraft now being designed and tested on supercomputers. The difference this makes is revealed with a specific detail: The tail fin of the Tornado fighter jet, based on old fashioned paper-based calculations, involved 12 stress tests. Its replacement, the Typhoon fighter jet, involved 186 million stress tests. The advances in technology are exponential, yet there arises a critical problem: once these new inventions and advances are made they become mere information, because they can be replicated and shared infinitely via digital means. They become literally price_less_ (we encounter an economy not of scarcity, but abundance). As significant and qualitatively different as the shift from merchant to industrial capitalism, the emergence of info-capitalism (which relates to other terms such as knowledge economy, information society, cognitive capitalism) poses critical questions for how society will change as a result.

As Peter Drucker put it in Post-Capitalist Society:

knowledge has become the resource, rather than a resource, is what makes our society “post-capitalist”. It changes, and fundamentally, the structure of society. It creates new social dynamics. It creates new economic dynamics. It creates new politics.

Paul Mason updates Drucker as follows:

Information is not some random technology that just came along and can be left behind like the steam engine. It invests all future innovation with the zero-price dynamic: biotech, space travel, brain reconfiguration or nanotechnology, and things we cannot even imagine.

The reference here to ‘things we cannot even imagine’ is highly pertinent to the domain of AI-powered innovation. Only this year, DeepMind announced its open source AlphaFold AI system, used to predict the 3D structures of proteins, has increased its predicted structures for plants, bacteria, animals, and other organisms 200-fold to over 200 million structures. Such rates of innovation and the scale of new knowledge (both the quantity of new knowledge, but also the qualitative means, i.e. ‘thinking’ at a scale far beyond human capacities) suggest a fundamental change in the present-day knowledge economy (as Drucker and others first foresaw).

According to James Lovelock, in his last book, Novacene, AlphaZero is said to be ‘at least 400 times as quick as a human’. Or rather, it is a lot faster because it not only learns but attains ‘superhuman’ capability: ‘That means we don’t even know exactly how much better it is [at playing a game such as chess] … because there are no humans it can compete with’. But, we do know, Lovelock writes, that a machine could be 1 million times faster:

…simply because the maximum rate of transmission of a signal along a electronic conductor … is 30 centimeters per nanosecond, compared with a maximum nervous conduction along a neuron of 30 centimeters per millisecond (a millisecond is 1 million times longer than a nanosecond).

These kinds of numbers and the predicated exponential increases (based on the fact that ‘superhuman’ computing power will unlock further advances in AI) leads Lovelock to prophesise the emergence of ‘electronic life’ as a new species. While this might sound farfetched (for now), certainly Drucker’s remarks on ‘post-capitalism’ retain currency.

The economic base has changed, and is changing, which has an impact on the ‘what, where and who’ of innovation. Crucially, the nature of information as the underlying principle of the knowledge economy presents a significant challenge to the capitalist mode of production. AI presents a further twist. When Karl Marx wrote of automatic systems as ‘the most perfected and most fitting form of the machine’, he could never have envisaged how computing and AI would take shape in the early to mid-twentieth century, and certainly he could not have understood how his notion of a ‘general intellect’ as a ‘general social knowledge’ could be encapsulated and indeed further enhanced by AI and big data. It is our contemporary (artificial) general intellect that now take us beyond debates of the formation of economy (which in themselves are significant). Not only are critical questions of post-capitalism worth exploring, but so are the prospects of a post-knowledge condition — another ‘great transformation’ (on the scale Karl Polanyi considered with the emergence of the free market), only in this case it is not just the effect of knowledge upon the economic life that is at stake, but is it the basis of knowledge itself that is transforming.

Post-Knowledge and Prompt Engineers

In advancing the case for a post-knowledge economy, it is necessary to outline and question what AI technologies are doing. AI provokes a good deal of anxiety and incredulousness. Yet, essentially, the underlying conditions are quite simple: Current AI modelling involves statistical processes trained upon massive data sets, combined with supercomputering (enabling multi-dimensional calculations, beyond anything a human could achieve). For all the scale and power required, there is a simple, elegant principle at stake: probability (as derived of the work of Norbert Wiener and Claude Shannon in the 1940s). The mathematical account of all information as a matter of probability, whereby we can seek to reduce probabilities to their simplest form of on/off, ones and zeros, gave rise, for example, to the pixel and bit as the smallest units of meaning. It is this account of information (originally devised to counteract the problems of ‘noise’ in communication processes and the use of early transistors) that paved the way for digital technology, which in turn gave rise to big data, which in turn has proven proficient for ‘brute force’ statistical methods in AI.

These are the conditions required of image diffusion, with AI trained on massive data sets of both images and language (and their relationality), and adopting, as outlined, a novel technique to first reduce the information to noise, before then reconstituting a new picture as a ‘probable’ outcome of the requested input. What is clever about the approach is that unlike AI language translation which seeks to predict the most likely (stable) translation of a word (i.e. a regular, ‘correct’ translation), image diffusion models are ‘predicting’ the most likely new image we will accept (or recognise) based upon the requested information. This is where image diffusion represents a significant innovation in AI image processing as it is a form of creative output. The necessary ‘units of analysis’ (the means to work through datasets using self-supervised learning) are easy enough to understand. AI processing is not ‘looking’ at pictures or ‘reading’ words when it learns, instead it is searching through the inner properties of images and text, giving values to the smallest units possible. In images this is the pixel, and indeed not simply individual pixels, but the relationality of pixels — i.e. how specific values of pixels (colours, density, intensity) tend to cluster as groups of pixels (as small elements within a large picture) according to larger ‘architectures’ of composition, and as set against language-based values associated with these compositions. By using massive archives of image and text as training data, supercomputers are able to chart through elaborate regularities of patterns, the formulations of which then accrue within AI neural networks, which, following the self-supervised training can be deployed as standalone interfaces.

Given the scaling achieved within modern supercomputing, it is fair to consider AI can offer a form of ‘reasoning’. Taking their cue from the philosopher Ludwig Wittgenstein, Steven Piantadosi and Felix Hill argue this is beginning to happen with large language models, whereby conceptual meanings, while not derived from direct references, emerge through internal reasoning, due to the way concepts in language usage ‘relate to each other’. The ramifications of Piantadosi and Hill’s account are significant. AI as reasoning machine, as it grows in sophistication and pervasiveness, opens up the prospect of a new foundation for the economy, one that goes beyond the knowledge economy.

Might we describe image diffusion as a form of reasoning? This is different to arguing a model is ‘thinking’ or is ‘sentient’. Rather, the image diffusion model is based on a probabilistic methodology that can make reasoned decisions to produce new images. Nonetheless, the biases at play (such as presenting chess players as all men) reveal problems in the data and lend credence to the criticism that AI merely imitates or parrots all that has been previously said, thought and described. It is only as good as the human record upon which it draws (the massive, but partial archive of images and texts that self-supervised AI can trawl through). Yet, we seem to be at a pivotal moment, with AI poised to take us into a post-knowledge economy. The weak form of this scenario will involve increasing adoption of AI models and tools, but only persisting with the parroting of all that it is already archived, without necessarily contributing anything new. It is a scenario of a ‘cut and paste’ culture (albeit cleverly rendered), taking us back to the postmodern debates of simulation (as in the film The Matrix). Arguably, this is the current position, with the human increasingly stepping aside to allow virtuoso AI systems to entertain us, to do our work, and emulate pseudo-innovation; while we make only peripheral, short-lived gains in the margins. In this scenario we become merely the prompt engineers, on the lookout for just that little bit more we can squeeze out of the system. The ‘role’ of prompt engineer has already gained some currency, albeit at an informal level (e.g. dallery.gallery), but already this is a domain that AI itself is entering, with tools such as Phraser emerging as meta-tools using machine learning to help users write prompts for neural networks.

A stronger form is of a truly post-knowledge economy, which takes the general intellect to its next logical stage. In this case, there is a generative account of information (such as the pictures churning out of DALL-E 2 and other such models), which act as new creative inputs in a site of exchange that is no longer the preserve of ours alone.

Future trajectories

To end on a pragmatic note, it is perhaps feasible to sketch mid- and long-term trajectories. In the mid-term, it is only a matter of time before the creative sector reorganises itself around Al technologies (as suggested already with Adobe’s new Photoshop features). This echoes fundamental change that took place with the maturing of digital (and networked) production, which dramatically altered the film, television and music industries, and of course gave rise to a fourth domain with the gaming industry (and the evident synergies between these areas). But, as set out here with the specific example of image diffusion, there is something qualitatively new about recent developments in Al methods and models. In the mid-term, we are likely to see the ubiquitous take-up of accessible and generally applicable AI tools. The growing success of the underlying large language models as the interface for these tools allow for many more users and businesses to participate. In turn, this will likely have two immediate effects. Firstly, in the same way social media ushered in previously unheard of job types and responsibilities, ubiquitous AI will open up the need for new job descriptions and skill sets. We will see, for example, the professionalisation of prompt engineers across a wide range of sectors. Secondly, there is likely to be a further shift away from technical skills to creative roles (with growing numbers able to create previously complex materials), and which in turn will allow for further expansion of the prosumer.

In the longer term, the burgeoning of AI as off-the-shelf technology will intensify ongoing debates around ownership, intellectual property, and infrastructural maintenance. Such arrangements are unlikely to alter the current dynamic of a narrow concentration of ownership for the development and platforming of AI. Yet, the deeper question of the very basis of a knowledge economy will inevitably surface more viscerally. Today, we can watch as AI tools generate and manipulate images within a matter of seconds; performing activities that humans might take days to complete (and even when using software such as Photoshop would require copious time and high level skills that many simply do not possess). AI tools are only going to get more sophisticated, faster and offer greater interoperability. In doing so, the shift to a new economy seems undeniable. It will be an economy of abundance and of new modes of production, wherein we will see rapid, micro-based innovation cycles (a massive acceleration of anything Drucker previously predicted); it will almost certainly prompt a shift to ‘work’ (‘projects’ we wish to engage in) not ‘employment’ (to borrow Bernard Stiegler’s distinction, which recognises a collapse in the labour wage); and will heighten debates and practices of energy security and sustainability. But, as outlined, it will also be the harbinger of a new kind of knowledge, one that might not be the sole preserve of the human, and which represents a massive (and massively open) general intellect.

Updated: