00:00:00Good to see so many here. As I already said in the intro, I'm going to give you a dive
00:00:14into Flux, our model family for generating images and editing images. I was already - is
00:00:22it working? I'm Andy, co-founder of Black Force Labs. Before I start with the model,
00:00:36I want to give you an overview of what we're doing. At Black Force Labs, we believe that
00:00:42visual media will become the central interface for human communication in the future. We see
00:00:48us as the central infrastructure provider to power all the images and the videos that
00:00:54humans will use to interact with each other, and not only what cameras can capture, but
00:01:01even way beyond that. With that in mind, we started the company in August 2024. Since then,
00:01:08we've grown it to 45 employees, and we are distributed amongst two headquarters. The main
00:01:15headquarters is in Freiburg in the Black Forest in Germany, and we also have an office here
00:01:20in SF. Since we released our image generation family, Flux, in August 2024 when we started
00:01:30the company, we've always structured releases in three different tiers, and we have constantly
00:01:38advanced the model family. The tiers are as follows. We have the Pro models. They are super
00:01:44powerful and the fastest models that we offer. They are available via the VFL API only and
00:01:50also via a couple of inference partners like, for instance, File and Replicate. I guess you
00:01:54also know them. They are super easy to integrate and scale to massive volumes nearly instantly.
00:02:03This is the first tier, but as some of you might know, my co-founders and me, we've very
00:02:10strong roots in open source similar to, I think, the founder who has invited us today. We are
00:02:19also the original developers behind Stable Diffusion. We still stick to that. We love
00:02:25the open source community, and that's why we also offer open weights and open source models.
00:02:29We have the Flux Dev models. These are publicly available for downloading, for tinkering. They
00:02:36are fully customizable and they offer a lot of flexibility for everyone who wants to use
00:02:40them. Finally, we have the Flux Schnell models. They are fully open source and they are, in
00:02:48a way, the perfect entry point into the Flux ecosystem. Speaking of the ecosystem, if you
00:02:56look at the Model Atlas on Hugging Face, which visualizes the, I think, most used open source
00:03:03foundation models across domains, we can actually see that the single largest model on Hugging
00:03:13Face that has the largest ecosystem attached to it is our Flux Dev model. That pretty much
00:03:21shows that Flux has already become the standard for open image generation. Obviously, we are
00:03:27looking to even advancing or extending our distribution in the future. That much for the
00:03:34company. Let me see if it's still not working. Anyway. Now for the main part of the talk.
00:03:41I wanted to dive into Flux with you, especially into our most recent model Flux Context, which
00:03:48unifies text-to-image generation and editing. I want to talk about today how to unify this.
00:03:56A couple of words before that. I think it's super important to have this joint model because
00:04:02obviously image generation has a lot of nice applications and we've seen this in the past
00:04:07year but image editing has until really this year not kept up with the same speed in kinds
00:04:14of development. Image editing is actually a super important use case. It allows us to iterate
00:04:20over existing images and gives just people, I think, an additional level of control to
00:04:25actually precisely modify images and stuff. This is super important. With Flux Context,
00:04:35we've created the defining moment for image editing. It was released in June 2025. It's
00:04:43a model that combines image generation with things for editing like character consistency,
00:04:50style reference, local editing and all that at near real-time speed. We'll see this later.
00:04:57But as a good example, I brought you this image row here. From left to right, we start with
00:05:01an input image. Then we can prompt the model to remove this object from her face and then
00:05:06we can set her into a completely new context while keeping the character consistent. This
00:05:13is super important. There was a bunch of work done in the past fine-tuning to actually get
00:05:20this kind of character consistency into the model based on publicly available text-to-image
00:05:25models but this instant image editing just allowed us to remove all that fine-tuning,
00:05:31which is always a bit effortful, I would say. This is actually super amazing that this takes
00:05:37now four seconds or something. Finally, we can just change the scenery. In this case,
00:05:43the rightmost image, we change it to a winter scene. Cool. Here are a couple of more examples
00:05:49what it also can do. It's not only good for character consistent edits or something, but
00:05:55it's also super nice for style transfer. We see that on the left side. We take the style
00:05:59from the input image and map it to a new content or we can do things like text editing, just
00:06:06changing the Montreal to Freiburg while keeping the font consistent. This is all combined in
00:06:12one model and you can interact with it just via a super simple text interface. Cool. Very
00:06:19importantly, it's not only a general model, this model, but it's also very good at solving
00:06:26specific important and interesting business problems. For instance, here in the left example,
00:06:33we can extract this skirt here from an in-the-wild image and we get a product shot of this thing
00:06:40and a zoom in nearly instantly, again, in a matter of seconds. This, before these editing
00:06:45models took hours, days, or was not even possible. Similar to on the right side here, we can get
00:06:53from a sketch to a completely rendered output in a couple of seconds. Cool. As I already
00:07:01mentioned, Flux context combines text-to-image and image editing. We just saw a couple of
00:07:07examples. Let's briefly look what this actually means in terms of the model pipeline that you
00:07:12need to do. Here we see the classic text-to-image pipeline. Pretty simple. We all know it. We
00:07:17use a text prompt. We push it through the model. The model then does some magic. I'll explain
00:07:21to you how to create such a model in a second. Then we get out an image that hopefully, if
00:07:28the model is good, follows our input text prompt. If you look at image editing, it looks
00:07:34quite a bit differently. We start with an image, which we show the model in a way, and then
00:07:41we don't add a text instruction that describes an entire scene, but only a change to that
00:07:46image. Here we have two conditionings. The first part, we only have more inputs. The first
00:07:50example, we only had one input. Now we describe a change and the model should then modify the
00:07:56image according to the change. Some parts, as the church here, should be the same after
00:08:05the edit. Others not. This is what these editing models do. It's quite a different task. Combining
00:08:12this into a single model is actually super nice because you can do everything. You can
00:08:18generate an image, then edit it afterwards, and get a lot more flexibility in a way. I
00:08:26already mentioned that before we released these editing models, or before we saw these general
00:08:31editing models, there was a bunch of work done on fine-tuning text-to-image models to get
00:08:36this kind of level of control into the model. But this is now not needed anymore. We can
00:08:42just do this instantly. This just brings down the time that you need to get nice results
00:08:48significantly. So this is it in terms of the pipeline. Now, let's look at how can we actually
00:08:57train these models. And there's a very important algorithm that I want to talk about. The algorithm
00:09:12that enables us to train these models is called Latent Flow Matching, which is composed of
00:09:17two aspects, Latent and Flow Matching, and I want to share a bit light on both of those.
00:09:24Let's start with the Latent. This comes from latent generative modelling. This is an algorithm
00:09:29that me and my co-founders came up with nearly five years ago. To explain what this means,
00:09:35let's first look at the following example. What I here visualise is basically two images,
00:09:41and for us, they look just the same. The left one is a JPEG, and the right one is the same
00:09:47image as a PNG. So the left one is an approximation of the right one, but we don't see any difference.
00:09:53Or is there anyone who sees a difference in these two images? I don't think so. Okay, now
00:09:59let's look at the file size of these images. The file size of the JPEG is actually close
00:10:06to an order of magnitude smaller than the file size of the PNG. This is quite remarkable,
00:10:13and we all know how image compression works, but just realising that we can remove apparently
00:10:19a lot of information from an image without noticing it is quite remarkable, I would say.
00:10:26So apparently there's a lot of information in an image that we cannot perceive with our
00:10:30human eye. Another way to visualise this is to plot the perceptual similarity of an image
00:10:39in the last example of a JPEG, and the approximation of this image - sorry, in the last example
00:10:44of a PNG is the image - and the approximation is the JPEG of this image, and we can plot
00:10:51it against the file size. When doing this, we get this plot. This is a conceptual plot,
00:10:56so this is not real, but it looks conceptually like this. The perceptual similarity quickly
00:11:03increases and then stays at a constant level for nearly the entire file size. This is what
00:11:11lossy compression algorithms like JPEG make use of, and you might ask now what does this
00:11:18have to do with generative modelling? It shows us that for a perceptual signal, or a natural
00:11:25signal, like an image, for audio it is actually the same, to look real, or to be perceived
00:11:32as real. We don't need to model all the high-frequency details that we cannot perceive, and hence
00:11:39training a generative model in the pixel space on all these high-frequency details would actually
00:11:44be a great waste of compute and time, because the model would learn to represent aspects
00:11:50that we don't even perceive, so it's pointless to learn this, right? And that is at the core
00:11:57of latent generative modelling. So instead of training a generative model in the pixel
00:12:01space directly on images, we learn a compression model that extracts a lower-dimensional so-called
00:12:09latent space. This latent space is what we see here in the centre. Let's see if the laser
00:12:15pointer works. Oh, yes, so this guy. How do we learn this model? It's actually super simple.
00:12:24We use an image here on the left. We push it through an encoder, so effectively this
00:12:29is an autoencoder, we push the image through the encoder, then we arrive at this latent
00:12:34space, and the representation we then push through an operation that is called regularisation.
00:12:42This forces the model to remove information from this latent representation. It can be
00:12:48implemented either discretely or continuously, and then again we reconstruct the image from
00:12:56this latent representation. So classical autoencoder, which we train to basically yield similar
00:13:04reconstructions to the input, and, very importantly, we add this discriminator loss. This can be
00:13:11imagined as a prior to make sure that actually only the details that perceptually matter to
00:13:19our human eyes are reflected in this latent representation. Again, this regularisation
00:13:24forces the model to reduce or to remove information, and the discriminator makes sure that it removes
00:13:32the right information that we cannot perceive. Like this, we arrive, once we have trained
00:13:36this model, at this latent space that then is used to train the generated model on. Latent
00:13:44space is a lower-dimensional representation of the input image or of an image that is perceptually
00:13:49equivalent. This is basically the latent aspect of the latent flow-matching algorithm. Let's
00:13:57talk about the second, flow-matching. Again, everything I explain right now happens in this
00:14:04latent space. So whatever we do right now, you see it here. On the left side, every image
00:14:15gets embedded into that latent space, basically. So, yes, let's talk about flow-matching. Flow-matching
00:14:22algorithms are a general family of algorithms that are used to translate from a very simple
00:14:31distribution, which is, in our case, always the standard normal distribution, so we're
00:14:35now talking about probability distributions. I visualised it here. This is a very simple
00:14:40distribution here. Flow-matching algorithms translate this or provide us with means to
00:14:47train a vector field that is represented by a neural network, that guy here, to map between
00:14:53the simple distribution and very complicated distributions, such as the data distribution
00:14:59of natural images. So this is the data distribution. What do we do to train this? The flow-matching
00:15:08algorithm provides us with a very simple means to do this. All we have to do during the training
00:15:15is to draw a sample from this standard normal distribution here. So we have a sample, and
00:15:21then we assign it to one sample from the data distribution, a training example, and we couple
00:15:27this, and then we can construct this kind of vector that directly, linearly connects them.
00:15:34If you do this for every example in our training dataset, so just we take the example, we randomly
00:15:40sample a point from the standard normal, and we connect them, then we arrive at this kind
00:15:45of here constructed vector field. I could now talk a lot about the properties of vector
00:15:55fields. One important property is that paths cannot cross in vector fields, and we see that
00:16:00there's a lot of crossing going on, so this is obviously not the true vector field that
00:16:05translates between every point on this distribution, or between this distribution and that one.
00:16:13The amazing thing about flow-matching is, if you just follow this rule, so we train the
00:16:17model to basically always predict these kind of vectors in between the data sample and the
00:16:22sample from the standard normal distribution. We arrive at the true vector field, and that
00:16:31looks then like this. So here we see that paths do not cross anymore, and the flow-matching
00:16:38algorithm just guarantees this. This is a bit of a magic, but if you write it down mathematically,
00:16:43we actually see that this makes sense. And like this, we can actually then train the model
00:16:49to represent this true vector field that translates between the standard normal and our data distribution.
00:16:59And importantly, we want to be able to create images based on text inputs, so what we do
00:17:07is we condition this network always on a text input basically, for every image example. Cool.
00:17:17So what are we doing when we're then sampling the model? We have this vector field that represents
00:17:22the mapping between those two distributions. What we do is then we start with a sample from
00:17:27the standard normal. We can sample from it with a computer, right? We all know that. And
00:17:33then we integrate along these trajectories represented by neural network. We can do this
00:17:39with a simple Euler-forward algorithm. Probably a lot of you might know them. So with a numerical
00:17:45integration scheme, we can just integrate along these trajectories here and then arrive at
00:17:51the data sample. We push it through the decoder again and we arrive. So again, this happens
00:17:56in latent space, but here we arrive then in the pixel space again. And this is how I then
00:18:02can create images based on a text prompt. Cool. One thing, these numerical integration schemes
00:18:11are pretty, I think, they use a lot of steps, so they break this down, this process here
00:18:20step by step into up to 50 steps. So these latent flow matching models natively are pretty
00:18:26slow and it takes about 30 seconds to one minute to generate an image, which is a bit long.
00:18:32I'll talk about how to make them fast very soon. But this is the general latent flow
00:18:37matching algorithm. So latent again, connect or represents this latent space or stands for
00:18:42this latent space where we train the model in. And the flow matching algorithm is what
00:18:46we just discussed here. Okay, now I explained how we create images based on text prompts,
00:18:54but how does this now apply to context, which is an editing model, right? This is also super
00:19:00simple. So this is a basic flux context architecture. It's a transformer model. We all know that.
00:19:05It's a bit special, but the magic lies in the input. So we see here on the left side
00:19:12the input into the model. First we have the text input that just gets embedded by a text
00:19:18encoder into a set of text tokens. And then we have the image encoder we already saw in
00:19:25the last slide here, right? This guy here. This is what we now see here. So we have this
00:19:37image encoder and there we have two sets of visual tokens. First we have the set of the
00:19:44visual tokens that we actually use to generate. This is what will be the output image. And
00:19:48then we have, if we want to do image editing, a second set of visual tokens that just model
00:19:55or that just represent the context image. So the reference image basically that I'm showing
00:20:01the model. And what we then do is we push this to the transformer model. It's a special one
00:20:05because it contains so-called double stream blocks. These are, I would say, kind of expert
00:20:12models for each fidelity. So here we handle the visual tokens and the text tokens separately.
00:20:20For everything except for the attention operation, the attention operation then happens jointly
00:20:26over all tokens. And then we have standard blocks, standard transformer blocks where
00:20:31we basically map all the input and the text tokens and the visual tokens with the same
00:20:38mappings before the attention operation. And like this, we can just go into image editing.
00:20:48If you provide an input image here and if you do text image generation, you just don't provide
00:20:52this and then we have only a text prompt as input, right? Cool. Last point here. How is
00:21:01the model so fast? So I don't know how many of you know flux models. Maybe can you just
00:21:08raise your hand if you know flux models? Or actually a couple of. Okay, cool. So we all
00:21:12know that they are pretty fast, right? What do I mean when I say fast? We are basically
00:21:19most often orders of magnitudes faster than comparable models. So here, for instance, we
00:21:24look at obviously a very slow model here but nice one, GPD image one. Also here for editing,
00:21:32the flux models are here more than 10 times faster, even more than, yeah, 20 times. So
00:21:39it's actually insane how fast they are comparable, comparably powerful models. And the reason
00:21:47for that is an algorithm we developed two years, three years ago. It's called adversarial diffusion
00:21:54distillation and the goal of this algorithm is to bring down the number of numerical integration
00:22:01steps. I told you earlier these are like most often 50 for a standard flow matching model
00:22:09and the goal here is to bring them down to as little as four. Each numerical integration
00:22:16step means a forward pass through the neural network so we can imagine that this just takes
00:22:20a long time so we want to reduce it as much as possible. How does it work? We initialize
00:22:27two networks here, a teacher and a student. Both of them are initialized from the learned
00:22:34flow matching model via the algorithm I just showed you. And what we then do is to train
00:22:40the student to get the same image quality in the output in four steps than the teacher does
00:22:48in 50 steps. This is the goal and this is how we do it. We start with an image, we encode
00:22:53it again to a latency here and then we generate an output image for the student in four steps
00:23:03or in the number of target steps that we want to do. And then we decode it again to pixels.
00:23:08In the beginning, this image here looks very blurry and very like just not realistic. And
00:23:15the goal is to improve this obviously. So what we are doing is to use this again, encode
00:23:19it again to latency and then do the same thing with the teacher but in 50 steps instead of
00:23:26four steps. This then results in a high quality image and we then use this distillation loss,
00:23:33basically just a loss to ensure that the distributions of the teacher or the student matches that
00:23:41of the teacher. This alone would unfortunately not allow us to basically generate images that
00:23:47are looking real. So what we add is another discriminator loss. We saw this already for
00:23:53the autoencoder part in the latent generative modeling part of the talk earlier. This is
00:24:01basically the same. So we train a discriminator to tell generated images from the student from
00:24:08real images that we input here. And this happens in a dyno v2 feature space or in a learned
00:24:16image representation model space in a way. And like this, we can actually then train the
00:24:21model to in the end generate realistic images instead of using 50 steps, it just uses four
00:24:29steps. That's obviously a huge speed up. However, last point here. If we look at this thing here,
00:24:37it looks pretty, I would say lots of overheads here, right? Because here we have to encroach
00:24:43to latent. So we start in the image space, being culture of the latent space, and we decode
00:24:48again, then we have to encode again and decode again. And then we, this one is also encoding
00:24:53again into another representation space. A lot of overhead, a lot of memory costs related
00:24:59to this. And this is just very, it was when after we came up with it, we were amazed by
00:25:03it because it allowed us to train fast models. It was so effortful to train this. So we thought
00:25:09about, okay, how can we actually simplify this? And the answer is always the answer. Just move
00:25:16it to the latent space whenever you have a pixel. So what we did is coming up with a latent
00:25:22adversarial diffusion distillation approach. It's basically very similar to what we did
00:25:27for the general latent generative modeling algorithm. We just move everything here to
00:25:31the latent space. Same thing, but instead of having to use these encoders and decoders,
00:25:38we can just get rid of those. And importantly, as a discriminator, we don't use dyno anymore.
00:25:44This image representation model, we use the teacher because that one anyway already lives
00:25:50in the latent space, provides us with a very nice image representation. So we can also use
00:25:55the teacher as a discriminator. And the rest of it is just basically nearly the same. We
00:26:02also remove the distillation loss. We found that we don't need it, which is also cool.
00:26:06So we have a loss less and gets everything simplified. And like this, we can actually
00:26:13then in a very memory efficient way also bring down the number of integration steps from five
00:26:21to four. So we have a 12.5 times speedup, and that's actually what we see as this order
00:26:26of magnitude in the plots I just showed you in the beginning of this section. So that's
00:26:32basically how we get a very fast model from a flow matching, from a base flow matching
00:26:40model. And now before this talk ends, I actually have brought you a demo to show you flux a
00:26:47bit in action. Let's see. So let's use it for image editing here. Let me upload something
00:26:56after. What are we doing here? This one looks good. Yeah. Okay. Yep. This is good. So here
00:27:12I start with a logo of my favorite football club, the SC Freiburg soccer club. I have to
00:27:17say soccer when I'm in the US. Okay. This is my favorite club and I want to create a t-shirt
00:27:22with this logo. So let's say put this logo onto a t-shirt. Feels a bit weird because I
00:27:45don't have a screen in front of me. Okay. Here we go. Generating. Let me make this a bit smaller.
00:27:53Maybe like this. Okay. Nice. We wait for a couple of seconds and we can get this nice
00:28:00logo on a t-shirt. And now the nice thing is we can actually go ahead, right? We can iterate
00:28:06on this. So let's say this logo is a bit too large, I would say. Make the logo smaller and
00:28:16put it on the rest part. Again. Wait a couple of seconds. Okay. Cool. And we arrive at a
00:28:39result that is actually super nice. That is actually what I wanted. I want to start with
00:28:46this one again. And I want to now change the color because the color of the SC Freiburg
00:28:52is not black, it's red. So make the t-shirt red. Also super simple. Now we're at local
00:29:06editing. We're just editing local parts of the image, right? In this case the color. And
00:29:15importantly, we've now done a couple of edits and we see still that the logo is very consistently
00:29:21represented. So this is the character or in this case object consistency that we saw. This
00:29:26is super important. Think about a marketer who just has an object and wants to set it
00:29:32into a certain context, right? This is in terms of the business value, it's great, it's super
00:29:39important. And now finally we add a more complex transformation. We can say put the t-shirt
00:29:48onto a man walking in the park. Oops. So this is a complex transformation and you could have
00:30:06said, okay, things like changing the color you can do in Photoshop, right? Historically,
00:30:12stuff like that, this is not what standard or earlier non-AI image generation tools used
00:30:21to be able or could do. This is actually super nice. So here we have like now this kind of
00:30:26min and finally, I think I'm at time, but let's do one last thing which shows how general this
00:30:31model is. We can also do style transfer, right? So let's say make this a watercolor painting.
00:30:42All right, final one. And before like these models you would have probably have--you would
00:30:54have trained this single fine tune for each of these kind of tasks and now we can just
00:30:58combine it in one thing which is pretty cool. Nice. So now I could print it out and hang
00:31:03it on my wall or something. Anyway, so yeah, I think this is showing the power of these
00:31:08models. Oh, that crashed something. I wanted to show you a last slide because I'm through,
00:31:17but we're hiring and if you want to join us, please scan this here or visit the playground,
00:31:22the demo I just showed you freely available. Thanks so much. I hope you learned something.