Agent Smith - will you take over?

Post image

The beginnings of artificial intelligence

The journey of AI began in the 1960s, when scientists attempted to replicate the functioning of neurons and brain circuits in technical models. As a result of the limited computing power and memory capacity, these early attempts failed. This led to a so-called “AI-Winter”. During this time, research into on AI was largely discontinued and those who persisted were often ridiculed.

The breakthrough with ChatGPT

The introduction of CHatGPT caused a new hype around AI. Although language processing and natural language processing (NLP) existed before, ChatGPT added a new dimension into the world of AI. Suddenly a model was available that could solve tasks previously reserved for humans. The ability of GPT-3.5 and later GPT-4 to pass school and university tests caused a significant breakthrough and generated a lot of attention, that quickly grew into an AI hype.

Superpowers for everyone? Or the end of work?

The possibilities created by AI are manifold. From text creation in marketing and code creation to image, music and video production - the use cases are numerous and the speed of development is breathtaking. One notable example is the controversy concerning an image created by AI that won an art competition . This discussion raises questions about copyright and the depth of creation.

Recently, studies have repeatedly discussed the potential of AI models to take over areas of human work. With it the question is often asked whether AI models like ChatGPT complement or replace humans work. In contrast to previous automation processes, which mainly replaced repetitive and simple tasks, new AI models seem to be able to take on complex and creative tasks.

How will this impact the work of agile teams? Will we be supplemented by tools or will Agent Smith take over?

The impact on software development

The possibilities of using AI in the field of software development are increasingly coming into focus. For some time now, developers have been able to use tools such as GitHub Copilot and Tabnine to help them to write code faster and more efficiently to get used to unknown code more quickly, learn new frameworks or even find errors.

Anyone who has already used the tools themselves knows that they can be an enormous help, but do not always lead to the desired results. This has not been a threat to developers so far - the copilot is not able to design, develop, test and document complex software products from start to finish. But this could change.

The limit of the current tools

The background to this rather limited performance is the fact that the tools work in single-shot-mode. This means that they are given a task and have to complete it in one go. They are given context, i.e. they can view the existing code in the project, but they cannot break the task down into several sub-steps and complete and improve them incrementally with feedback.

To visualize the effect of this restriction, consider the following case: You are given the task of writing a program with a certain functionality. However, you only have one attempt and have to write it from top to bottom in one go. It is not possible to make corrections by using the cursor keys or the delete key. Furthermore, you cannot start the program in order to test it.

This approach may still work for simple tasks, but it quickly becomes impractical for more complex tasks. After all, it is not how humans work: we read the task, think of a solution, break down the task into steps, write a part of the code, test it, correct it, continue writing, refactor and thus iteratively approach a complete, working solution.

What if we could transfer this approach to the tools using language models?

This is where the idea of agent-based workflows comes into play.

The role of agents

Agent-based workflows are a new approach in AI development. They make it possible to solve complex problems by the collaboration of specialized agents. For example, one agent can act as a planner, while other agents take on specific tasks such as writing code or creating documentation. Agents can also interact with each other in feedback loops, critically review each other’s work, suggest improvements and thus support each other in building better solutions.

Agent - more than a large language model
Agent - more than a large language model

Here we see the concept of an agent as a diagram. A human interacts with agents and communicates the task. The agent is assigned a specific role with its system-prompt, e.g. “You are a senior Angular front-end-developer”. It also has access to additional context, helpful information, additional tools and a language model to complete the assigned task.

An example of a software engineering agent with a corresponding environment and interfaces is Princeton University’s SWE-Agent project, which provides an agent that implements bugs and issues in GitHub repositories in an autonomous and self-supervised manner. To put this in context: according to the benchmark SWE-Bench the SWE agent achieves a complete solution in 12.99% of the tasks.

However, this alone would not be real progress compared to the previous approach. The real potential of agent-based workflows unfolds when we employ a team of agents in different roles working together on a complex topic.

Agentic workflows for improved performance
Agentic workflows for improved performance

Here we can see an example of an agent-based workflow. A planning agent coordinates the collaboration of various special agents, each of which is responsible for specific tasks. This collaboration allows to break down complex problems into individual, manageable steps and solve them using specially configured agents.

The architecture of the project is designed by an agent that is tailored to the specific best practices and specifications of the project (e.g. via Prompt Engineering or Retrieval Augmented Generation ). The architectural design is critically reviewed by a second agent that specializes in compliance with quality standards, best practices, architectural- and security-guidelines.

The Angular agent show in the graphic above takes over the implementation of the front-end stories. It is supported by a review agent which checks completeness and quality of the code and makes suggestions for improvement. Test are created and executed by a specialized test agent. Any errors found are reported back to the implementation agent and rectified by the latter. Another agent is responsible for creating the documentation.

Additional performance is provided by the possibility of equipping the agents with language models specially tuned to their respective tasks. For example, an agent that is responsible for generating code can use a language model that is specifically tailored to the syntax and semantics of programming languages (e.g. also through model fine-tuning).

In theory, this concept can be used to build a complete team of specialized agents which work together in different roles and support each other via feedback loops to jointly build a complex project. This approach promises substantial higher performance than the simple single-shot approach.

How do agentic workflows look like in practice?

Practical applications

It is clear that agent-based workflows are already being used in various areas today. Companies such as Microsoft and Meta are experimenting with these technologies, for example at Meta, to improve unit test coverage.

Startups are demonstrating their vision of agent-based solutions for complex software engineering tasks and are currently being funded generously, such as Devin von Cognition Labs .

GitHub is launching a currently still private “Technical Preview” of the GitHub Copilot Workspaces which is intended to enable collaboration between developers and AI models in an integrated development environment.

However, there are also some open-source projects, such as GPT-Pilot , ChatDev , and Devika , which translate our theoretical concept of a team of agents into practice and illustrate the potential of this new approach to work.

Experiment: GPT-Pilot

We took a closer look at the GPT Pilot and conducted an experiment to experience what working with a team of agents feels like in practice.

GPT-Pilot is a command-line tool written in Python that can create entire apps. For this purpose, it defines a team of ten agents with different tasks:

  • Product Owner-Agent: Responsible for the entirety of the project breakdown into tasks.
  • Specification Writer-Agent: Asks questions to better understand the requirements if the project description is not sufficient.
  • Architect Agent: Writes down technologies that will be used for the app and checks if all technologies are installed on the computer. If not, it installs them.
  • Tech Lead-Agent: Writes development tasks which must be implemented by the developer.
  • Developer-Agent: Takes each task and describes what needs to be done to implement it. The description is in human-readable form.
  • Code Monkey-Agent: Takes the developer’s description and the existing file and implements the changes.
  • Reviewer-Agent: Checks every step of the task, and if something is done wrong, sends the task back to the Code Monkey.
  • Troubleshooter-Agent: Helps to give good feedback to GPT-Pilot if something is wrong.
  • Debugger-Agent: In the case of an error, the agent tries to find the cause of the error from the information provided and gives advice on how to rectify the error.
  • Technical Writer-Agent: Writes documentation for the project.

In addition to the agent roles, GPT-Pilot defines a workflow that coordinates agent collaboration, enables feedback loops and monitors progress.

Specification of the requirements

The creation of a sufficient description of the desired result is - as in “real life” - not trivial: in addition to the functional requirements, specifications for the technical basis (e.g. framework, database), the architecture and structure of the project should also be considered, as well as non-functional requirements such as performance, scalability and security.

If you now feed the GPT-Pilot with your specification, it is possible that it will have questions or needs clarification. Once the task has been sufficiently described from the agent’s point of view, the agent begins to structure the project and breaks it down into tasks and work packages.

We have defined the requirements for a simple time-tracking-tool named “TimR” inspired by the example in the GPT-Pilot Wiki .

Implementation

Once the steps and tasks have been defined, the implementation agents take over. After the successful implementation of an increment and positive review by another agent, test cases are created, which the human client should then carry out and, if necessary, give feedback in the form of error reports (actual/desired behaviour, error messages).

In other words: the agents tell the human to start the current application and executes the test cases manually. The feedback is then returned to GPT-Pilot where it undergoes a multi-stage analysis by several agents. These agents propose changes, correct errors, review the status, and then resubmit the revised application for testing along with updated test cases.

In this manner, each larger task is broken down into smaller, manageable steps and is iteratively processed by specialized agents. Humans are involved in the process at certain points in order to check the quality of the results and provide feedback.

In our experiment, completing the project implementation required approximately one hour, utilizing the OpenAI GPT-4 model, and incurred a cost of about EUR 10 for the tokens consumed.

It was remarkable that the provided increments were executable at all times and that the small application was created step by step in sensible stages.

TimR - Simple Time Tracking
TimR - Simple Time Tracking

Throughout the manual testing phase, several issues were encountered, such as absent menu options and improperly implemented API endpoints. These issues were swiftly identified and analyzed by the designated agents, leading to corrections by the implementation agent. Subsequent tests confirmed the successful resolution of these errors.

TimR - Reports
TimR - Reports

The functionality and design aspects of the features are currently basic, with particular emphasis on the reporting charts and error pages needing enhancement. Exploring the potential improvements through refined requirements could yield significant advancements.

The repository, containing the generated code and documentation (with the sole exception of an update to the README file), is available on GitHub .

To set up and run the application, follow these steps:

Install the necessary dependencies.

npm install

Before you are able to start the application, you’ll need a MongoDB instance, which you can simply start with Docker.

docker run --name mongodb -d -p 27017:27017  mongodb/mongodb-community-server:latest

The database URL is placed in the .env-file.

DATABASE_URL=mongodb://localhost:27017/timr
SESSION_SECRET=123

Then, the application can be started.

npm run start

On the login screen, start by creating a new user account. Once the account is set up, you can log in to begin recording your initial entries.

Findings

It was impressive to see how the “team of agents” analyzed the problem, structured it and broke it down into small, manageable steps. The cooperation of the agents was efficient and goal-oriented, and the quality of the results is astonishing high. Errors in the implementation were also quickly rectified following appropriate feedback.

At first glance, the created project meets the requirements and has a sensible structure. The provided documentation makes it easy to get started.

We would also like to see a test suite that can be used to validate the functionality and validation of changes and refactorings. This can probably be achieved by adding a test automation agent to the team.

This experiment raises pivotal questions about the trajectory of software development. It prompts us to consider whether significant portions of the software development lifecycle could be autonomously managed by collaborative agents. These agents might independently navigate complex design and implementation challenges, consulting humans only for critical decisions or final approvals. Alternatively, it poses the question if humans will continue to lead, with agents serving as virtual assistants to enhance our productivity.

Naturally, this raises immediate concerns about accountability for the generated code and the sustainable development of the software project using such tools. Moreover, blind reliance on a blackbox software-engineering-mechanism poses significant risks.

Challenges and future prospects

Despite the promising approaches, there are still challenges to overcome. The description of tasks must be precise and detailed in order to achieve optimal results. The AI unfortunately still does not enable the automatic “do what I want” out of a paltry requirement description. ;-)

In addition, the interaction and collaboration of the various agents is a complex process that needs to be developed further. One detail of the code generation that we observed in our experiment was that the code files were always completely regenerated, even if, for example, only details were to be added to a method.

Enhancing the incremental code generation process could significantly bolster the stability of development by increments. The existing methodology poses risks such as overwriting previously implemented segments, the disappearance of features, or the introduction of errors into already tested components.

With the current developments towards ever more powerful LLMs, enormous context lengths (over a million tokens), LLMs specialized in the respective task areas and multimodal LLMs are opening up new possibilities for supporting software development.

Advanced agent frameworks significantly enhance the creation, deployment, and interaction of agents, as well as ensuring workflow stability. This advancement opens the door to discovering which methodologies will be successful and what new applications can emerge from these innovations.

In our attempto Lab, we engage deeply with these inquiries, crafting bespoke agents and workflows tailored to diverse scenarios. This approach allows us to thoroughly investigate the capabilities and boundaries of emerging technologies.

Conclusion

The development of AI has come a long way, from its humble beginnings in the 1960s to today’s advanced models and agent-based workflows. The opportunities that arise from this are enormous and could fundamentally change the way we develop software.

The idea of solving complex problems through the collaboration of specialized agents promises higher performance and efficiency. The first experiments and projects are showing promising results and give an idea of the potential of this new way of working.

As things stand today, the available tools are still at an early stage and do not meet the high requirements that we place on high-quality software development. We will continue to monitor progress in this area and regularly test advanced tools.

At the same time, however, we must also face up to the social, ethical and legal issues which are already clearly emerging today. It is up to us to decide how we want to deal with the new opportunities as a society.

You May Also Like