I Turned Cheap Cloud Storage Into a 1PB Local Drive (With JuiceFS)

Englishالعربية Deutsch Español Français हिन्दी Bahasa Indonesia 日本語 한국어 Português Русский 中文

Computing/SoftwareInternet Technology

Transcript

00:00:00This is Juice.FS. It's a high-performance, open-source, distributed file system designed to provide the infinite scalability of cloud object storage with the full functionality and speed of a local file system.

00:00:14In this video, we'll take a look at Juice.FS, see how it works, and I'll show you how you can set up your own local high-performance network attached storage solution with Juice.FS.

00:00:24It's going to be a lot of fun, so let's dive into it.

00:00:30So standard object storages like S3, Google Cloud Storage, or Backblaze B2 are incredibly cost-effective, but interacting with it typically requires specialized APIs or specialized tools that break traditional application workflows.

00:00:48Juice.FS acts as a transparent abstraction layer.

00:00:51It separates your data from your metadata, pushing raw data chunks directly to your cloud provider while managing the file system layout, permissions, and directory structures inside a fast database like Redis, Postgres, or TIKV.

00:01:07What makes Juice.FS completely different from traditional cloud gateways or standard network file shares is this strict architectural separation and its aggressive multi-tiered caching engine.

00:01:19Instead of forcing your applications to wait on high-latency cloud network requests every time a file is accessed,

00:01:26Juice.FS breaks files into small, optimized blocks and utilizes a local NVMe or SSD partition as a hot scratch space.

00:01:35The first time an application reads or writes data, it communicates over the network, but the second time that data is requested, it serves instantly from local storage at hardware line speeds.

00:01:47This allows legacy applications, databases, machine learning, training pipelines, and container environments to run directly on top of object storage without rewriting a single line of code.

00:01:59So all that sounds great, but let's test it out for ourselves and see how it works.

00:02:04So for this demo, I'm going to set up a local network attached storage or NAS, which will host all my data in my own remote S3 bucket and will use Redis as our metadata engine.

00:02:16The very first thing you want to do is spin up a Redis instance with Docker and you can easily do that with this command.

00:02:24And then we need to initialize the file system by running the juice.FS format command.

00:02:29This step tells juice.FS exactly how to map our database to our storage bucket.

00:02:34We pass it our Redis connection string, our AWS S3 bucket name, and our cloud access credentials.

00:02:41But before doing all that, make sure you have created an S3 bucket and an access key and a secret key before running this command.

00:02:48I had already created mine beforehand.

00:02:50So now when I run this juice.FS doesn't actually alter anything inside our S3 bucket yet.

00:02:56It simply configures the storage schema inside Redis and assigns a unique UUID to our new virtual file system.

00:03:03And once the format step is complete, we mount the device to our local machine using the juice.FS mount command.

00:03:10We point juice.FS to our Redis instance and provide a local directory path.

00:03:15In my case, a folder inside my home directory.

00:03:18And before you run this, there's an important caveat.

00:03:21If you're on a Mac, because Mac OS doesn't support custom file systems outside of the box, you need to install a kernel extension utility called Mac Fuse first.

00:03:30This provides the underlying software hooks juice.FS needs to communicate with the Mac operating system.

00:03:37And then I'm also providing this flag with the free space ratio, because by default, it sets it to 20% of your drive, which is quite high.

00:03:47And basically it tells juice.FS that if the local drive hosting your cache drops below a certain percentage of its total capacity,

00:03:55it needs to stop writing new cache files and start aggressively purging the oldest least accessed blocks.

00:04:01This keeps your local operating system from completely running out of disk space.

00:04:05All right.

00:04:06So now let's run this command.

00:04:07And the second it executes, the operating system registers a standard POSIX compliant file system mount.

00:04:15And to the computer, it looks like we just plugged in a massive external hard drive with one terabyte of available space.

00:04:23And now I can easily drag files into this directory like it's an external drive.

00:04:28And if we now go to the S3 bucket dashboard, we see that juice.FS has stored our files and split them into data chunks.

00:04:37And all this happens behind the scenes and we don't have to do any of the heavy lifting.

00:04:42And to show you how the caching works, we're going to benchmark the file system using a classic terminal command DD.

00:04:49So if you haven't used DD before, it's a built in utility used for raw data copying.

00:04:55And in this specific command, if stands for input file, which is pointing to one of our large video files that I added to our juice.FS drive.

00:05:03And the off stands for output file.

00:05:06And we're routing this data straight to dev slash null, which is exactly a black hole in our operating system that disregards data instantly because we're not actually copying the data.

00:05:17We're just doing a benchmark in this example.

00:05:19And we also set the block size to four megabytes to match how juice.FS slices data chunks.

00:05:25And finally, we prefix the whole line with the time utility so we can see exactly how long the file transfer takes.

00:05:32And the first time we hit enter, this is what is called a cold read because the file was just uploaded.

00:05:38Our local machine doesn't have a copy of it yet.

00:05:41So juiceFS has to reach out over the Internet to our S3 bucket, fetch all of those four megabyte data chunks one by one and stream them down.

00:05:50And on my connection, the first run takes, as you can see, considerably long.

00:05:55But watch what happens when we run the exact same command the second time.

00:05:59So boom.

00:06:00There you have it.

00:06:01The terminal prompt returns almost instantly.

00:06:03The second run took less than a single second because now it's in our cache and we're reading it organically.

00:06:10So this is how the multi-tiered caching engine works in action.

00:06:14While juiceFS was busy downloading those chunks during our first run, it was silently copying them into our local NVMe scratch disk.

00:06:22But on the second pass, the system bypassed Internet entirely, pulling the data directly from local hardware line speeds.

00:06:29So you get the infinite cheap storage capacity of the cloud combined with the zero latency speed of a local drive.

00:06:37So in this demo, I used video files to showcase the functionality, but this applies to almost any real world infrastructure scenario.

00:06:45If you're a DevOps engineer managing container environments, you can just use juiceFS to provide shared persistent storage across a Kubernetes cluster.

00:06:54And instead of paying for expensive cloud block storage for every single node, all your pods can mount the exact same juiceFS volume simultaneously.

00:07:03Sharing configuration files, application assets, or user uploads globally while keeping costs incredibly low.

00:07:10And it's also a massive win for machine learning and data science pipelines.

00:07:14Because if you have a massive data set, say hundreds of gigabytes of training images or text data sitting in an S3 bucket, training an ML model usually requires downloading that entire data set locally first, which wastes time and storage space.

00:07:30But with juiceFS, your training script can start running instantly.

00:07:35The pipeline reads the data sequentially through the mount and juiceFS handles the high throughput streaming and local caching in the background, keeping your GPUs fully saturated without local storage bottlenecks.

00:07:49And there's one more cool thing I want to show you.

00:07:51You can also easily hook up metrics to your file system with better stack.

00:07:55Every time you mount a juiceFS volume, it silently spins up a local Prometheus compatible metrics server in the background.

00:08:03If you open your browser and go to this URL, you can see all the metrics here in plain text, which are tracking every cache hit, read duration, and S3 request errors in real time.

00:08:13And to pipe this telemetry data directly into our dashboard, we can use better stacks native Prometheus scraping feature.

00:08:21First, we need to go to sources and connect our juiceFS as a source.

00:08:25And let's choose Prometheus scrape from the metrics tab and click connect source.

00:08:31Now we need to ingest our logs.

00:08:33But before we do that, we need to open a secure public tunnel to our local metrics port using a tool like ngrok and then paste our ngrok URL.

00:08:43But to make it work with ngrok, we also need to go to the advanced options and add a custom HTTP header named ngrok skip browser warning and set it to true.

00:08:53And this tells ngrok to bypass the warning page entirely, allowing better stack to scrape raw metrics securely every few seconds.

00:09:01And now our metrics should start ingesting automatically.

00:09:05And let me show you the coolest part.

00:09:07If we now go to better stacks AI SRE, we can just prompt the AI SRE to create us a dashboard that monitors cache performance or latency and the throughput of our system.

00:09:19And in a few seconds, the AI SRE will craft us a custom beautiful dashboard with all the metrics coming in from juiceFS.

00:09:27And we can also see that it is updating in real time.

00:09:32So how cool is that?

00:09:34So I hope this little demo shows you just how powerful juiceFS can be when you combine it with modern infrastructure monitoring.

00:09:41We managed to take a cheap standard cloud storage bucket, turn it into an infinitely scalable local drive operating at hardware line speeds and hook it up to a fully automated observability dashboard in just a few minutes.

00:09:57So there you have it, folks.

00:09:58That is juiceFS in a nutshell.

00:10:00What do you think of juiceFS?

00:10:02Have you tried it?

00:10:03Will you use it?

00:10:04Let us know in the comments section down below.

00:10:06And folks, if you like these types of technical breakdowns, please let me know by smashing that like button underneath the video.

00:10:12And also don't forget to subscribe to our channel.

00:10:15This has been Andrus from BetterStack, and I will see you in the next videos.

Key Takeaway

JuiceFS converts cost-effective cloud object storage into a high-performance, POSIX-compliant local drive by using a metadata database and an aggressive, multi-tiered local caching engine.

Highlights

JuiceFS acts as a transparent abstraction layer that separates file data from metadata to enable local-speed access to cloud object storage.
The system utilizes a multi-tiered caching engine, storing frequently accessed data on local NVMe or SSD partitions to achieve hardware line speeds.
Initialization requires a metadata engine like Redis, Postgres, or TiKV alongside an object storage provider such as AWS S3 or Backblaze B2.
The filesystem automatically handles large file splitting into small, optimized chunks, which can be monitored via an integrated, Prometheus-compatible metrics server.
JuiceFS eliminates the need to download entire datasets locally for machine learning pipelines by streaming data through a POSIX-compliant mount.

Timeline

Architecture and Core Functionality

JuiceFS abstracts object storage into a local, POSIX-compliant file system.
Data is stored in cloud object storage while metadata is managed in databases like Redis.
A multi-tiered caching engine stores data chunks locally to minimize network latency.

The system separates raw data from metadata, pushing chunks to cloud providers while organizing directory structures and permissions in a fast database. By using local NVMe or SSD storage as a hot scratch space, the system serves data instantly after the initial request. This approach allows legacy applications and container environments to interact with cloud storage without code modifications.

Deployment and Setup

Initialization involves formatting the filesystem with a Redis connection string and S3 bucket credentials.
macOS users must install Mac Fuse to provide necessary kernel hooks for custom filesystems.
The free space ratio setting prevents local cache from consuming more than a specified percentage of disk capacity.

Setting up involves spinning up a Redis instance via Docker and executing a format command to map the database to the object storage bucket. Once formatted, mounting the device registers a standard POSIX-compliant drive on the local machine. The free space ratio parameter ensures the local system remains stable by purging old cache blocks when disk space is low.

Performance Benchmarking

Cold reads require downloading chunks over the network, resulting in higher initial latency.
Warm reads bypass the network entirely by pulling data directly from the local cache.
Using the 'dd' command with matching block sizes demonstrates the speed difference between remote and local cached access.

Benchmarking with the 'dd' utility reveals the efficiency of the caching layer. While the first read fetches data over the internet, subsequent reads are nearly instantaneous because the data is pulled directly from local hardware. This confirms that the system effectively provides cloud capacity with local-drive performance.

Infrastructure and Observability

JuiceFS supports shared persistent storage across Kubernetes clusters, reducing block storage costs.
Machine learning pipelines can stream datasets through the mount without local storage bottlenecks.
Integrated Prometheus metrics enable real-time monitoring of cache hits and throughput.

Beyond basic storage, the system facilitates DevOps and data science workflows by allowing multiple pods or training scripts to access the same volume. Integration with observability tools like Better Stack enables real-time dashboard creation. Using tools like ngrok to tunnel the local metrics server allows external platforms to scrape performance data securely.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video