The Biggest Problem Of AI Coding Is Finally Solved
AAI LABS
Computing/SoftwareSmall Business/StartupsInternet Technology
Transcript
00:00:00AI made coding accessible to everyone and people have started shipping code at a much
00:00:04faster pace.
00:00:05But at an even faster pace, security issues inside those apps started piling up.
00:00:09And in the past few months, things have actually gotten worse.
00:00:12There have been many instances when an agent deleted someone's entire project.
00:00:16Another agent deleted an entire production database while the developer was working on
00:00:20something completely unrelated.
00:00:22And there have been many similar issues like Apple's internal Clod.md being leaked.
00:00:26So tooling that can actually catch these issues matters more now than it did.
00:00:30Seeing this rise of issues, Versil just released a security harness to detect breaches in AI-generated
00:00:35applications called DeepSec.
00:00:37Now you might think Clod code can already do security reviews on its own with its agents.
00:00:42So why would you need DeepSec in the first place?
00:00:44It's because DeepSec is a structured tool that handles reviews far more systematically.
00:00:49Under the hood, it's using coding agents like Clod code and Codex.
00:00:52The tool is designed for scanning large repositories because it supports a parallel design that
00:00:57speeds up the workflow and batches code into multiple groups, which makes it perfect for
00:01:01reviewing large code bases.
00:01:03Now this is not built with cost-effectiveness in mind.
00:01:06They are using the most powerful models of Clod code and Codex, which are Opus 4.7 at
00:01:10max effort and GPT 5.5 at x high reasoning, both of which consume a lot of tokens.
00:01:16And with them running in parallel, the token usage piles up quickly, increasing cost.
00:01:20Several known apps have already run this harness on their code bases and reported good results.
00:01:25In the test they ran, the false positive rate of this tool is roughly 10-20%.
00:01:30This number is significant given how LLM accuracies usually are.
00:01:33Conversely, this means the agent is correct most of the time and its true positives are
00:01:37high.
00:01:38The architecture behind this is what makes it different.
00:01:40If you ask Clod code or any agent for a security review, it will start by directly scanning
00:01:45the code base and then produce a full review report.
00:01:48That not only takes a lot of time, but it also consumes a lot of tokens and the review
00:01:52might still miss things.
00:01:53So the first part of this workflow is scanning, performing a RegEx-only scan of all files for
00:01:58security-sensitive areas that the subsequent steps will focus on.
00:02:01RegEx detection matters here because the tool is designed for large code bases where there
00:02:06can easily be thousands of files.
00:02:08RegEx matching is a series of code patterns that match known areas likely to have security
00:02:13vulnerabilities and then filter those files out from the large pool.
00:02:16Once the large pool of files has been filtered, the next step is investigation using the agent.
00:02:21The agent is the expensive part consuming a lot of tokens and typically taking a long
00:02:25time depending on how big your code base actually is.
00:02:28So this tool splits all the files into batches and parallelizes them so they can be processed
00:02:32at the same time.
00:02:34Once that process is done, there's another step of revalidation where the investigation
00:02:37is checked again so that false positives are cross-checked.
00:02:40In case something was missed, it catches that and ensures the classification has been done
00:02:45correctly.
00:02:46This revalidation is actually optional.
00:02:47After that, the agent uses Git metadata and other sources to identify which people are
00:02:51responsible for which issues.
00:02:53Once all of that is done, the findings are stored as markdown or JSON so that they can
00:02:57be turned into tickets for humans as well as coding agents.
00:03:01Now as mentioned earlier, the files are grouped into batches with around 5 files processed
00:03:05together per batch.
00:03:06For each batch, a fresh prompt is assembled based on the identified framework along with
00:03:11other project information.
00:03:12These are then analyzed by the Clod Agent SDK or Codex Agent SDK whichever you have configured
00:03:17and they're given tools with read-only access to understand what the code base contains.
00:03:22Once they have the findings, everything is merged into a single file that is deduplicated
00:03:26and normalized.
00:03:27At the end, there's a follow-up step to make sure the analysis has actually covered everything.
00:03:31This architecture makes it effective because of its systematic process and structured analysis
00:03:36method and it helps identify issues far better than it could without the harness.
00:03:41So to test this out, we used an open source project that is a web application containing
00:03:45built-in security risks just for practice.
00:03:47We wanted to see if this tool was able to detect all of the issues in this repo on its
00:03:52own.
00:03:53This project contains 10 security issues with all the details available directly in the code
00:03:56itself including how to remove them.
00:03:58So to run deepsec, you first run the deepsec init command which installs the dependencies
00:04:03and creates a .deepsec folder and then you install the dependencies inside that folder.
00:04:08It also gives you a prompt that you need to paste into whichever coding agent you use.
00:04:12Since we were using claud code, we ran that prompt in claud which contains the instructions
00:04:16for creating a small info.md file that includes all the project information and is built around
00:04:21a specific template.
00:04:23You do not have to run this command in the project folder itself, you run it in the .deepsec
00:04:27folder because it instructs the agent to look in the previous directory and read all the
00:04:31information from it.
00:04:32The info.md file contains a general overview of what the code base does and what the authentication
00:04:37flow looks like, as well as the threat models, project specific patterns and all the known
00:04:42false positives inside the code.
00:04:44So once this file has been created, the next task is to run the deepsec scan command.
00:04:48This command is the regex matcher we previously talked about and it finds all the matching
00:04:52endpoints and lists all the filtered files containing potential security issues.
00:04:57This part happens fast because it's just code working in action.
00:05:00The next step is to run the deepsec process command.
00:05:02You can specify any API key of the model you want to use, whether it is the Vercel API gate
00:05:07way, codex or claud inside the .env.local file.
00:05:11But if you do not do so, like we didn't, it automatically defaults to the claud code subscription
00:05:16and uses your authentication instead of requiring any API key.
00:05:19It splits the project into batches and calls multiple tools on each one.
00:05:23After each batch, it gives a summary of how many tokens were used and what the estimated
00:05:27cost was.
00:05:28Now, if you are using a subscription, it will not charge anything beyond your subscription
00:05:32but it still provides an estimate for API costs.
00:05:35Since this is designed for large codebase reviews, it keeps reliability in mind.
00:05:39So in case there are any errors during the review, it does not restart everything from
00:05:43scratch and instead continues from the point where the error occurred.
00:05:46Once the scan has been completed, you run the deepsec report command and it generates a report
00:05:50in both JSON and Markdown format containing a general overview of all the findings categorized
00:05:55by severity level.
00:05:56Now, once this report has been generated, you can run the revalidation step.
00:06:00This step is entirely optional.
00:06:02You can run it if you want or skip it completely.
00:06:04Once you run it, it validates the findings to check whether the reports are false positives
00:06:08or not.
00:06:09After that has been done, you can export everything using the export command and it will write
00:06:13the findings into the findings folder.
00:06:15This findings folder contains the issues ordered by priority as folder names and creates one
00:06:20file per identified issue.
00:06:22It first lists the source of the issue meaning the exact file and the lines causing the issue,
00:06:26how severe the issue is and how confident the model was in identifying it.
00:06:30It also mentions which commit introduced the issue and assigns the user who committed it.
00:06:34It then explains the recommended fix, lists the revalidation results and mentions all
00:06:39the issues that were explicitly addressed.
00:06:41It also includes the steps to reproduce the bugs inside the findings.
00:06:44But this report still did not identify all of the issues, even though the tutorial was
00:06:48actually inside the code itself and it should have been able to identify them.
00:06:52So we iterated with Claude on why the original vulnerability lessons that were bundled into
00:06:56the app by design were not identified.
00:06:59Upon iteration with Claude, we found that the reason this tool only reported 3 findings was
00:07:03because of an explicit mention in the info.md file.
00:07:07DeepSec expected an app where the 10 vulnerabilities are already known and it only focused on issues
00:07:12besides them because they were already known, meaning it was actually trying to go beyond
00:07:16what was already known and only focus on other patterns so that the scan becomes much more
00:07:21effective and does not waste time and tokens on issues that are already documented.
00:07:25We then tested another app to see if it did better this time.
00:07:28We ran the same steps, starting from the scan to the processing stage.
00:07:32We did not run the revalidation part, we just created the report and exported it directly.
00:07:36And this time Claude's info.md file only contained details about the app and did not include statements
00:07:42like the previous one.
00:07:43Side by side, we also asked Claude to review the code and write a report.md file with a
00:07:48complete security review so we could compare which one actually performed better.
00:07:52So the report created by DeepSec found multiple bugs with different severity levels.
00:07:56It found 9 issues and created a detailed report along with recommended steps on how to fix
00:08:01them.
00:08:02And these recommended steps are what most other reports miss because this is what helps
00:08:05the agent understand how to fix the issue, which makes debugging much easier.
00:08:09But we noticed that Claude's report was much more detailed and highlighted 39 issues.
00:08:13So we asked it to create a diff first.
00:08:15The diff showed that Claude's number was larger.
00:08:18But we had already seen this during our testing with Codex.
00:08:20Claude tends to identify other issues in addition to the scope along the way.
00:08:24It does not solely focus on the scoped issues that DeepSec was specifically designed for.
00:08:29So once we asked it to focus only on scope, it narrowed the findings down to 13 issues.
00:08:34But there were still a few issues that DeepSec missed which were identified in Claude's report.
00:08:38The reason DeepSec missed a few findings is because it focuses only on issues that the
00:08:43code directly contains and that can be resolved directly from the functions themselves.
00:08:47It does not identify issues that might arise when the app actually runs, like cores related
00:08:52problems.
00:08:53It does not really focus on logical patterns and architectural decisions either.
00:08:57As we mentioned previously, it uses RegEx to filter out files first.
00:09:01So it mainly focuses on what is explicitly present in the code and not on issues that
00:09:05may occur dynamically when the application is running.
00:09:08Also if you are enjoying our content, consider pressing the hype button because it helps us
00:09:12create more content like this and reach out to more people.
00:09:15Now instead of running these steps one by one on our own, we've created this DeepSec skill
00:09:20which contains all the instructions on how to use Vercel's security scanner end to end
00:09:24and how it should identify from the user's prompt what is being asked.
00:09:28It then follows the entire step by step process and manages the whole harness on its own.
00:09:32It is also bundled with multiple assets, evals and references for all the issues, along with
00:09:37multiple scripts that might actually help with the working solution and the overall functioning
00:09:42of this repository.
00:09:43So with this in place, you can just run this security scan and specify which model you want
00:09:47to use and it will directly handle everything for you.
00:09:50It will run through all the steps we saw earlier along with addressing the issues that it missed
00:09:54previously and will be able to perform a much better security review by combining DeepSec's
00:09:59abilities while also covering the gaps in its findings.
00:10:02Now this skill along with all resources can be found in AI Labs Pro for this video and
00:10:07for all our previous videos from where you can download and use it for your own projects.
00:10:11If you've found value in what we do and want to support the channel, this is the best way
00:10:15to do it.
00:10:16The link is in the description.
00:10:17That brings us to the end of this video.
00:10:19If you'd like to support the channel and help us keep making videos like this, you can do
00:10:23so by using the super thanks button below.
00:10:25As always, thank you for watching and I'll see you in the next one.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video