Log in to leave a comment
No posts yet
I know that chilling feeling you get every time you look at an Amazon S3 bill. For a data engineer, S3 is as essential as air, but calling APIs tens of thousands of times for testing and moving massive files can lead to a situation where the overhead costs more than the actual value. As of 2025, S3 Standard storage costs $0.023 per GB, but the real nightmare is the Data Transfer (Egress) fee. Once you cross 100GB, you're looking at $0.09 per GB—nearly four times the storage cost. To save on these costs, many try spinning up MinIO locally, only to struggle when it diverges from production code. Here is the setup guide I use in practice to solve this.
Hardcoding S3 addresses directly into your application code is a dangerous habit. Accidentally leaving a local address during deployment can lead to system failures. The Boto3 library reads system environment variables before internal code configurations. By leveraging this priority, you can point to MinIO locally while automatically using AWS S3 in production.
Setup Method
AWS_S3_ENDPOINT_URL=http://localhost:9000 in your local .env file.os.getenv("AWS_S3_ENDPOINT_URL").boto3.client("s3", endpoint_url=endpoint).With this configuration, since the production server won't have that environment variable, Boto3 will default to the standard AWS address. This is the most reliable way to bring the cost of tens of thousands of PUT/GET requests during local testing down to $0.
You'll run into errors if you try to use Terraform code defined for production infrastructure with local MinIO as-is. This is because the Terraform AWS provider attempts to validate the actual AWS Account ID by default. In a local environment, you need to intercept and bypass this validation.
Terraform Configuration Example
s3 = "http://localhost:9000" within the endpoints configuration of the provider block.s3_use_path_style, skip_credentials_validation, and skip_requesting_account_id attributes all to true.mock_key for access_key and secret_key.By applying these settings, Terraform will create bucket policies and Lifecycle Rules on local MinIO without needing a real AWS account connection. This effectively lowers deployment failure rates by allowing you to catch errors in your infrastructure definitions beforehand.
To properly evaluate query performance, your mock data needs scale. However, generating data with standard loops is agonizingly slow. I use Polars or Apache Arrow. Polars uses vectorized operations, making it up to 10 times faster than Pandas.
Data Generation Process
Faker library and create 100,000-row chunks using Polars.write_to_dataset feature of the pyarrow engine to save Parquet files partitioned in Hive style (e.g., year=2026/month=04).Repeatedly uploading and downloading 100GB of data in the cloud can result in hundreds of dollars in charges. It is much better for your wallet to test by pushing your local hardware to its limits.
When testing serverless logic that triggers automatically upon file upload, you can use MinIO's bucket notification feature. MinIO supports a Webhook function that sends JSON data to a specified HTTP endpoint whenever an object is created.
Implementation Steps
MINIO_NOTIFY_WEBHOOK_ENDPOINT in the MinIO settings to your local server address.s3:ObjectCreated:Put event is correctly received by the local server.Reliability during event bursts is determined by the queue size and the event occurrence rate (
). For a local testing environment, it's better for your peace of mind to set a generous queue_limit.
Sometimes files created inside a Docker container won't open on the host machine due to permission issues. Especially for macOS users, you must check if 'VirtioFS' is enabled in the Docker Desktop settings. VirtioFS is up to 98% faster than the legacy gRPC FUSE method for file system processing. This difference is palpable when handling large datasets.
Permission Fix
--user $(id -u):$(id -g) option during docker run to match host and container permissions./data path of the container.Setting up a solid local environment gives you a perfect laboratory to dissect infrastructure mechanics without worrying about costs. Beyond just saving money, it allows you to maintain an independent development rhythm that isn't dictated by cloud environments.