Dolt: This Makes SQL Feel Like Git

BBetter Stack
Computing/SoftwareSmall Business/StartupsInternet Technology

Transcript

00:00:00your code has git your data has what exactly that's the problem one bad csv one config row
00:00:07one spreadsheet edit and now your app is broken there's no clean diff no branch pull request no
00:00:13obvious rollback this is dolt it's a sql database that works just like git you can branch tables
00:00:20edit rows diff changes commit them and merge them back actual version control for actual data i'll
00:00:26show you how to spin this up and how to get it going in the next couple minutes
00:00:35now we know most of the time databases are great at what being databases they store data they let
00:00:41you query with sql but they are not great at the workflows we use every day which is branching
00:00:47reviewing diffing merging rolling back seeing exactly who changed what so oftentimes we pick
00:00:54one of two bad options option one is you keep the data in a real database you get sql indexes
00:01:00constraints and structure but when the data changes the review process is usually not quite there
00:01:07then option two is we put the data in a csv json or yaml so git can actually track it now you get
00:01:13commits and pull requests but you lose things databases are actually good at no real sql weak
00:01:20schema enforcement painful diffs and merges and when someone asks who changed this record well the answer
00:01:27is basically going to be the same someone with database access that's not any kind of real
00:01:32workflow but now imagine this instead what if you could run a branch command dolt branch fix this data
00:01:39dolt diff dolt commit dolt merge now these are commands we're already using but we're using them
00:01:46against your actual database tables that's what dolt is doing it's version control for our databases
00:01:52if you enjoy coding tools to speed up your workflow be sure to subscribe we have videos coming out
00:01:57all the time enough talking we have options there's dolt for sql lite postgres you name it let's do the
00:02:04fast version of this i'm gonna cd in here and i'm gonna clone the dolt hub getting started from github
00:02:10i'll cd into the folder now first clone a public dolt database and i'm gonna run dolt sql now we are
00:02:18inside sql so i can run sql commands right here in the terminal okay cool i'm gonna make a small change
00:02:27and we're gonna run dolt diff and this is the first wait what just happened here kind of moment
00:02:34dolt does not say some file changed it shows the actual table diff which row changed which column
00:02:43changed the old value and the new value i can see right here now we can commit that so dolt add
00:02:50then i can run don't commit m i'll drop in a comment i can create an entire branch of this by using
00:02:56checkout and we'll just run checkout b name your branch if i make another change on top of this i can
00:03:03diff it again with gold diff i can commit it again and then i can add it again now if i jump back and
00:03:10merge i can check out main and i can run dolt merge all these commands we already know we're just doing
00:03:17this now with sql at the end you could run dolt log now your database has a commit history not a backup
00:03:24not a dump file and not some spreadsheet edit log a real version database that's the whole core idea here
00:03:31git workflows but for tables now let's pull back everything and see how all this actually works
00:03:37at first dolt feels familiar on purpose you have commands like dolt status diff add commit branch
00:03:44checkout if you know git your brain already understands the shape of this entire workflow
00:03:48they're going for dolt is not tracking files it's tracking relational tables you can use it from the
00:03:55command line or you can run dolt sql server by doing that you can now connect it using mysql compatible
00:04:01clients orms bi tools or application code so your app can treat dolt like a normal sql database but you
00:04:09get version control around the data that's the important part you're not choosing between a real
00:04:14database and a git workflow you get both in the same place dolt uses something called crawly tree the easy
00:04:22version of crawly tree here is a normal database uses tree-like structures to make reads and writes
00:04:29fast dolt uses a tree-like structure that is also good at versioning so instead of copying the whole
00:04:36database every time you commit dolt can share the parts that stayed the same and track the parts that
00:04:42actually changed now we're not just asking things like hey what's the current value we can actually ask
00:04:47things like what did this row look like before something happened that's the big thing here
00:04:52because when something breaks we don't want to have to guess you want to inspect the history i can now
00:04:56review the diff you can see the change and if needed you can roll back this is version control
00:05:02for structured data your branches commits diffs merges histories for rows and columns so where does dolt
00:05:10actually fit into the flow of things because this all sounds great but this is where it might get
00:05:15confusing you might hear git for data and you're probably thinking okay well look we already have
00:05:21tools for that yes we kind of do have tools for those but they solve different problems you can put csvs
00:05:28and json files in git that works when the data is tiny and simple git doesn't understand your schema
00:05:35it doesn't know your primary key and it's not going to enforce constraints it can't run joins on your csv
00:05:41unless you add more tools so git gives you version control but it's not really for a database
00:05:47then there's dvc dvc is great for ml workflows especially large data sets and model artifacts but
00:05:53it's not trying to be your live relational database yes you have lake fs that brings git like ideas to
00:06:00object storage and data lakes very useful at lake scale but again that's an entirely different layer it's
00:06:07it's not the same thing as saying branch the sql table change some rows run a diff and merge it back
00:06:13traditional databases also have history tools temporal tables audit logs cdc but most of them
00:06:20don't feel like a normal workflow they don't give you the clean loop branch change diff merge rollback
00:06:27i wouldn't blindly drop dolt into every production system that's not the point here the point is this
00:06:33if your work involves structured data that changes over time and those changes can actually break things
00:06:40i think dolt is worth trying the first time it saves you from one silent bad data change the workflow
00:06:46starts to feel a bit more obvious we have git why don't we have something for data now we kind of do if
00:06:52you enjoy coding tools like this be sure to subscribe to the better stack channel we'll see you in another
00:06:57video

Key Takeaway

Dolt brings Git-like version control to SQL databases, allowing developers to branch, diff, and merge relational data without sacrificing schema enforcement or performance.

Highlights

  • Dolt provides version control for SQL databases, enabling branch, diff, commit, and merge operations on tables.

  • Unlike Git, which tracks files, Dolt tracks relational tables to maintain schema enforcement and primary key constraints.

  • Dolt uses a Prolly tree structure to share unchanged data across commits, preventing the need to copy the entire database for every change.

  • Database interaction occurs via standard terminal commands or through a SQL server compatible with MySQL clients, ORMs, and BI tools.

  • Users can query historical data to identify specific row or column values before and after a change, replacing ambiguous audit logs.

Timeline

The Problem with Data Versioning

  • Standard databases lack native version control features like branching, diffing, and merging.
  • Storing data in CSV or JSON files inside Git allows for version control but sacrifices SQL indexes, constraints, and efficient querying.

Database management often forces a choice between functional SQL environments and version-controlled file systems. Conventional database workflows struggle with error recovery because they lack clean rollback paths and granular change tracking. Using files like CSV for version control breaks database utility by removing schema enforcement and making data modification history difficult to audit.

Dolt Functionality and Workflow

  • Dolt enables SQL-specific commands such as `dolt branch`, `dolt diff`, `dolt commit`, and `dolt merge`.
  • Executing `dolt diff` displays granular changes at the row and column level rather than just file-level modifications.
  • Database histories are maintained as a series of commits, providing a verifiable log of data state changes.

The command-line interface mirrors Git, making the workflow immediately familiar to developers. After cloning a database, users can modify rows, track changes through diffs, and create branches for testing or data fixes. This system replaces traditional backup dumps or spreadsheet edit logs with a concrete commit history for relational tables.

Architecture and Use Cases

  • Dolt functions as a standard MySQL-compatible SQL server when accessed via applications or BI tools.
  • The underlying Prolly tree structure optimizes storage by sharing unchanged data across versions instead of duplicating the database.
  • Dolt targets structured data environments where accidental changes have a high risk of breaking production workflows.

While other tools like DVC or LakeFS address versioning for machine learning models or object storage, Dolt specifically handles relational databases. It provides the necessary structure to enforce schema constraints while maintaining a branch-and-merge workflow. This allows teams to inspect historical row states to diagnose and revert data corruption without relying on guesswork.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video