Transcript
00:00:00your code has git your data has what exactly that's the problem one bad csv one config row
00:00:07one spreadsheet edit and now your app is broken there's no clean diff no branch pull request no
00:00:13obvious rollback this is dolt it's a sql database that works just like git you can branch tables
00:00:20edit rows diff changes commit them and merge them back actual version control for actual data i'll
00:00:26show you how to spin this up and how to get it going in the next couple minutes
00:00:35now we know most of the time databases are great at what being databases they store data they let
00:00:41you query with sql but they are not great at the workflows we use every day which is branching
00:00:47reviewing diffing merging rolling back seeing exactly who changed what so oftentimes we pick
00:00:54one of two bad options option one is you keep the data in a real database you get sql indexes
00:01:00constraints and structure but when the data changes the review process is usually not quite there
00:01:07then option two is we put the data in a csv json or yaml so git can actually track it now you get
00:01:13commits and pull requests but you lose things databases are actually good at no real sql weak
00:01:20schema enforcement painful diffs and merges and when someone asks who changed this record well the answer
00:01:27is basically going to be the same someone with database access that's not any kind of real
00:01:32workflow but now imagine this instead what if you could run a branch command dolt branch fix this data
00:01:39dolt diff dolt commit dolt merge now these are commands we're already using but we're using them
00:01:46against your actual database tables that's what dolt is doing it's version control for our databases
00:01:52if you enjoy coding tools to speed up your workflow be sure to subscribe we have videos coming out
00:01:57all the time enough talking we have options there's dolt for sql lite postgres you name it let's do the
00:02:04fast version of this i'm gonna cd in here and i'm gonna clone the dolt hub getting started from github
00:02:10i'll cd into the folder now first clone a public dolt database and i'm gonna run dolt sql now we are
00:02:18inside sql so i can run sql commands right here in the terminal okay cool i'm gonna make a small change
00:02:27and we're gonna run dolt diff and this is the first wait what just happened here kind of moment
00:02:34dolt does not say some file changed it shows the actual table diff which row changed which column
00:02:43changed the old value and the new value i can see right here now we can commit that so dolt add
00:02:50then i can run don't commit m i'll drop in a comment i can create an entire branch of this by using
00:02:56checkout and we'll just run checkout b name your branch if i make another change on top of this i can
00:03:03diff it again with gold diff i can commit it again and then i can add it again now if i jump back and
00:03:10merge i can check out main and i can run dolt merge all these commands we already know we're just doing
00:03:17this now with sql at the end you could run dolt log now your database has a commit history not a backup
00:03:24not a dump file and not some spreadsheet edit log a real version database that's the whole core idea here
00:03:31git workflows but for tables now let's pull back everything and see how all this actually works
00:03:37at first dolt feels familiar on purpose you have commands like dolt status diff add commit branch
00:03:44checkout if you know git your brain already understands the shape of this entire workflow
00:03:48they're going for dolt is not tracking files it's tracking relational tables you can use it from the
00:03:55command line or you can run dolt sql server by doing that you can now connect it using mysql compatible
00:04:01clients orms bi tools or application code so your app can treat dolt like a normal sql database but you
00:04:09get version control around the data that's the important part you're not choosing between a real
00:04:14database and a git workflow you get both in the same place dolt uses something called crawly tree the easy
00:04:22version of crawly tree here is a normal database uses tree-like structures to make reads and writes
00:04:29fast dolt uses a tree-like structure that is also good at versioning so instead of copying the whole
00:04:36database every time you commit dolt can share the parts that stayed the same and track the parts that
00:04:42actually changed now we're not just asking things like hey what's the current value we can actually ask
00:04:47things like what did this row look like before something happened that's the big thing here
00:04:52because when something breaks we don't want to have to guess you want to inspect the history i can now
00:04:56review the diff you can see the change and if needed you can roll back this is version control
00:05:02for structured data your branches commits diffs merges histories for rows and columns so where does dolt
00:05:10actually fit into the flow of things because this all sounds great but this is where it might get
00:05:15confusing you might hear git for data and you're probably thinking okay well look we already have
00:05:21tools for that yes we kind of do have tools for those but they solve different problems you can put csvs
00:05:28and json files in git that works when the data is tiny and simple git doesn't understand your schema
00:05:35it doesn't know your primary key and it's not going to enforce constraints it can't run joins on your csv
00:05:41unless you add more tools so git gives you version control but it's not really for a database
00:05:47then there's dvc dvc is great for ml workflows especially large data sets and model artifacts but
00:05:53it's not trying to be your live relational database yes you have lake fs that brings git like ideas to
00:06:00object storage and data lakes very useful at lake scale but again that's an entirely different layer it's
00:06:07it's not the same thing as saying branch the sql table change some rows run a diff and merge it back
00:06:13traditional databases also have history tools temporal tables audit logs cdc but most of them
00:06:20don't feel like a normal workflow they don't give you the clean loop branch change diff merge rollback
00:06:27i wouldn't blindly drop dolt into every production system that's not the point here the point is this
00:06:33if your work involves structured data that changes over time and those changes can actually break things
00:06:40i think dolt is worth trying the first time it saves you from one silent bad data change the workflow
00:06:46starts to feel a bit more obvious we have git why don't we have something for data now we kind of do if
00:06:52you enjoy coding tools like this be sure to subscribe to the better stack channel we'll see you in another
00:06:57video
Community Posts
No posts yet. Be the first to write about this video!
Write about this video