Can AI Coding Systems Earn $1 Million As Freelancers?

OpenAI researchers challenged cutting edge AI systems to tackle software development tasks that humans had been paid $1 million to solve. This is how they performed.

The Physics arXiv Blog

By The Physics arXiv Blog

Feb 21, 2025 7:00 PM

(Credit: DC Studio/Shutterstock)

Newsletter

Sign up for our email newsletter for the latest science news

Freelance software engineering is a lucrative and dynamic field where skilled developers tackle diverse challenges, from bug fixes to full-stack feature development. In recent years, these workers have been among the first to incorporate AI systems into their workflow to help write code.

That raises an interesting question: could an AI system do the same job by itself? In other words, have software engineers effectively developed themselves out of their own jobs?

Now we get an answer of sorts thanks to the work of Samuel Miserendino, Michele Wang, and colleagues at OpenAI Research, who have developed a benchmarking tool that determines whether state-of-the-art large language models (LLMs) can complete a set of real software development tasks that have been solved by humans. These human developers earned themselves $1 million in the process, raising the obvious question of whether AI systems could earn their crust alone.

The answer will be of limited comfort to human developers. “Real-world freelance work in our benchmark remains challenging for frontier language models,” say Miserendino, Wang and co. Nevertheless, they calculate that the best models could successfully earn a significant fraction of the $1 million.

Code Red

Software engineering involves far more than just writing code. Engineers must interpret client requirements, navigate complex codebases and make high-level architectural decisions about the correct approach. Real-world freelance jobs require full-stack development, debugging and managerial skills.

Assessing the performance of large language models at these tasks is tricky because most benchmarks involve standard coding problems, which represent just a small part of the freelancer’s challenge.

Miserendino, Wang and co set out to change this by creating a database of real software engineering tasks previously solved by human freelancers. They call their benchmark SWE-Lancer and hope it will become a standard against which to test the real-world coding performance of advanced large language models.

The team sourced the freelance tasks from Expensify, a public company that owns an expense management system used by 12 million customers. This software requires constant maintenance and development, for which the company relies on freelance workers. Expensify makes these coding tasks public and posts them to the freelancer website Upwork.

The OpenAI team chose 1488 of these tasks. About half of them were aimed at individual programmers and involved tasks like developing coding patches to resolve real-world issues. The other half of the tasks were for managers and involved selecting the best solution from competing proposals submitted by human freelancers.

All the tasks had been completed by human freelancers who were paid amounts varying from $250 to $32,000. The total value of all the tasks was $1 million.

To put the current state-of-the-art AI models through their paces, the team set each task to Anthropic’s Claude 3.5 Sonnet, and OpenAI’s GPT-4o and o1 models. The AI systems were given the text describing the issue as it appeared on the Upwork platform along with a snapshot of the code before the fix was made, along with the objective in fixing the issue.

For the management tasks, the models were given various proposed solutions to a problem, a snapshot of the code to be fixed and the goal in picking the most suitable solution.

The results are illuminating. “Sonnet 3.5 performs best, followed by o1 and then GPT-4o,” say Miserendino, Wang and co. But they were far from perfect. “All models earn well below the full $1 million USD of possible payout on the full SWE-Lancer dataset,” say the researchers.

Nevertheless, there is a healthy return for some problems. “On the full SWE-Lancer dataset, Claude 3.5 Sonnet earns over $400,000 out of $1,000,000 possible.”

That seems like a reasonable income for a freelance developer using AI to [del automate] assist in their work. But there are clearly limitations. The AI systems performed better on manager tasks than individual coding tasks, which often produced superficial fixes rather than addressing root problems. This suggests AI is better at evaluating solutions than implementing them.

Overall, the AI systems were able to tackle less than 50 percent of the available tasks, which leads the team to a somber conclusion. “The real-world freelance work in our benchmark remains challenging for frontier language models,” say the researchers.

Money Making

The team say that LLMs inability to outperform human freelancers stems from several fundamental issues. For example, AI models lack a deep understanding of code — instead, they are merely pattern generators. Human engineers also iteratively refine their solutions, running tests and debugging unexpected behaviors, an approach that LLMs struggle to copy.

But while LLMs aren’t ready to replace human engineers, the SWE-Lancer benchmark reveals exciting potential. It suggests that AI assistants are likely to help automate routine coding tasks, so that human developers can focus on higher-level problem-solving.

One thing the researchers do not focus on in detail is the time taken to complete the tasks by humans versus machines. It may be that AI systems are not currently much better at some tasks but that they are significantly faster. That will inevitably feature in business planning.

But they show that some tasks are ripe for automation and probably already being done in this way by enterprising freelancers and businesses. This proportion is likely to increase as the models become more capable.

And judging by the improvements AI models have achieved on other benchmarks for advanced mathematics problems and the like, this improvement is likely to accelerate quickly.

Clearly, the time for disruptive change is now.

Ref: SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? : arxiv.org/abs/2502.12115

artificial intelligence

1 free article left

Want More? Get unlimited access for as low as $1.99/month

Subscribe

Already a subscriber?

Register or Log In

1 free articleSubscribe

Want more?

Keep reading for as low as $1.99!

Subscribe

Already a subscriber?

Register or Log In