VOA科学技术2024--Will AI Systems Run Out of Publicly Available Data on the Internet?

时间：2024-08-12 08:23:04

搜索关注在线英语听力室公众号：tingroom，领取免费英语资料大礼包。

（单词翻译）

A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years.

Training data includes writing and information publicly available on the Internet. AI companies use the internet to "train" AI systems to create human-sounding writing. This "training" is what developers use to create large language models. Currently, many technology companies are developing large language models this way.

The nonprofit research group Epoch¹ AI examines issues relating to AI. It has been following the development of large language models for a few years. In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032.

The team's latest paper has been reviewed by experts, or peer reviewed. It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California.

A 'gold rush'

Researcher Tamay Besiroglu is one of the paper's writers. He compared the current situation to a "gold rush" in which limited resources are depleted². He said the field of AI might face problems as the current speed of development uses up the current supply of human writing.

As a result, technology companies like the maker³ of ChatGPT, OpenAI and Google are seeking to pay for high quality data. Their goal is to ensure a flow of good material to train their systems. OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material. The researchers consider this a short-term answer.

Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development. That could lead companies to seek online data considered private, such as email and phone communications. They also might increasingly use AI-created data, such as chatbot content.

A 'bottleneck⁴' in development?

Besiroglu described the issue as a "bottleneck" that can prevent companies from making improvements to their AI models, a process called "scaling up."

"...Scaling up models has been probably the most important way of expanding their capabilities⁵ and improving the quality of their output."

The Epoch AI group first made their predictions two years ago. That was weeks before the release of ChatGPT. At the time, the group said "high-quality language data" would be exhausted⁶ by 2026. Since then, AI researchers have developed new methods that make better use of data and that "overtrain" models on the same data many times. But there are limits to such methods.

While the amount of written information that is fed into AI systems has been growing, so has computing⁷ power, Epoch AI said. The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens.

But whether a "bottleneck" in development is a concern remains⁸ the subject of debate.

Nicolas Papernot teaches computer engineering at the University of Toronto. He was not involved in the Epoch study. He said building more skilled AI systems can come from training them for specialized⁹ tasks. Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as "model collapse¹⁰."

Permission and quality

Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models. Wikipedia has placed few restrictions¹¹ on how AI companies use its articles, which are written by volunteers.

But professional writers are worried about their protected materials. Last fall, 17 writers brought a legal action against Open AI for what they called "systematic¹² theft on a mass scale." They said ChatGPT was using their materials, which are protected by copyright laws, without permission.

AI developers are concerned about the quality of what they train their systems on. Epoch AI's study noted¹³ that paying millions of humans to write for AI models "is unlikely to be an economical way" to improve performance.

The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with "generating lots of synthetic¹⁴ data" for training. He said both humans and machines produce high- and low-quality data.

Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models.

"There'd be something very strange if the best way to train a model was to just generate...synthetic data and feed that back in," Altman said. "Somehow that seems inefficient¹⁵."

Words in This Story

exhaust -v. to completely use up a resource

depleted -adj. when a resource is almost used up

trajectory¹⁶ -n. the direction that something is taking or is predicted to take

synthetic -adj. created by a process that is not natural

scale -n. the level of size of a thing

generate -v. to create something through a process

分享到：

点击

收听单词发音

1 epoch
n.(新)时代；历元
参考例句：
The epoch of revolution creates great figures.革命时代造就伟大的人物。 We're at the end of the historical epoch,and at the dawn of another.我们正处在一个历史时代的末期，另一个历史时代的开端。

2 depleted
adj. 枯竭的, 废弃的动词deplete的过去式和过去分词
参考例句：
Food supplies were severely depleted. 食物供应已严重不足。 Both teams were severely depleted by injuries. 两个队都因队员受伤而实力大减。

3 maker
n.制造者，制造商
参考例句：
He is a trouble maker，You must be distant with him.他是个捣蛋鬼，你不要跟他在一起。 A cabinet maker must be a master craftsman.家具木工必须是技艺高超的手艺人。

4 bottleneck
n.瓶颈口，交通易阻的狭口；妨生产流程的一环
参考例句：
The transportation bottleneck has blocked the movement of the cargo.运输的困难阻塞了货物的流通。 China's strained railroads already become a bottleneck for the economy.中国紧张的铁路运输已经成为经济增长的瓶颈。

5 capabilities
n.能力( capability的名词复数 )；可能；容量；[复数]潜在能力
参考例句：
He was somewhat pompous and had a high opinion of his own capabilities. 他有点自大，自视甚高。来自辞典例句 Some programmers use tabs to break complex product capabilities into smaller chunks. 一些程序员认为，标签可以将复杂的功能分为每个窗格一组简单的功能。来自About Face 3交互设计精髓

6 exhausted
adj.极其疲惫的，精疲力尽的
参考例句：
It was a long haul home and we arrived exhausted.搬运回家的这段路程特别长，到家时我们已筋疲力尽。 Jenny was exhausted by the hustle of city life.珍妮被城市生活的忙乱弄得筋疲力尽。

7 computing
n.计算
参考例句：
to work in computing 从事信息处理 Back in the dark ages of computing, in about 1980, they started a software company. 早在计算机尚未普及的时代（约1980年），他们就创办了软件公司。

8 remains
n.剩余物，残留物；遗体，遗迹
参考例句：
He ate the remains of food hungrily.他狼吞虎咽地吃剩余的食物。 The remains of the meal were fed to the dog.残羹剩饭喂狗了。

9 specialized
adj.专门的，专业化的
参考例句：
There are many specialized agencies in the United Nations.联合国有许多专门机构。 These tools are very specialized.这些是专用工具。

10 collapse
vi.累倒；昏倒；倒塌；塌陷
参考例句：
The country's economy is on the verge of collapse.国家的经济已到了崩溃的边缘。 The engineer made a complete diagnosis of the bridge's collapse.工程师对桥的倒塌做了一次彻底的调查分析。

11 restrictions
约束( restriction的名词复数 )；管制；制约因素；带限制性的条件（或规则）
参考例句：
I found the restrictions irksome. 我对那些限制感到很烦。 a snaggle of restrictions 杂乱无章的种种限制

12 systematic
adj.有系统的，有计划的，有方法的
参考例句：
The way he works isn't very systematic.他的工作不是很有条理。 The teacher made a systematic work of teaching.这个教师进行系统的教学工作。

13 noted
adj.著名的，知名的
参考例句：
The local hotel is noted for its good table.当地的那家酒店以餐食精美而著称。 Jim is noted for arriving late for work.吉姆上班迟到出了名。

14 synthetic
adj.合成的，人工的；综合的；n.人工制品
参考例句：
We felt the salesman's synthetic friendliness.我们感觉到那位销售员的虚情假意。 It's a synthetic diamond.这是人造钻石。

15 inefficient
adj.效率低的，无效的
参考例句：
The inefficient operation cost the firm a lot of money.低效率的运作使该公司损失了许多钱。 Their communication systems are inefficient in the extreme.他们的通讯系统效率非常差。

16 trajectory
n.弹道，轨道
参考例句：
It is not difficult to sketch the subsequent trajectory.很容易描绘出它们最终的轨迹。 The path followed by a projectile is called its trajectory.抛物体所循的路径称为它的轨道。

本文本内容来源于互联网抓取和网友提交，仅供参考，部分栏目没有内容，如果您有更合适的内容，欢迎点击提交分享给大家。