1 Billion Token Law
Last updated
Last updated
With the rapid development of Artificial Intelligence (AI) technology, we are at a historic turning point where AI capabilities are growing at an unprecedented rate, reshaping the way we interact with the world.
Behind all of this is the thirst for big data - a fact that AI progress is often limited by the volume of data available for training. This leads us to explore a simple yet profound principle: "The Billion Token Law". This law is inspired by Malcolm Gladwell's "10,000 Hour Rule" proposed in his book "Outliers: The Story of Success", which suggests that achieving expertise requires long hours of deliberate practice.
Similarly, in the field of AI training, we observe that providing massive amounts of high-quality data is crucial to cultivate AI models capable of deeply understanding and executing complex tasks. Here, the "Billion Token Law" is not just a data standard, but an ideology symbolizing the provision of sufficient knowledge to AI models to facilitate their depth of understanding and learning.
In the paper "Scaling Laws for Neural Language Models," we see direct evidence of the power-law relationship between model performance and data scale. From small-scale experiments to scaling up to massive datasets, these relationships provide valuable guidance on how to effectively advance AI technology. This chapter aims to delve deeper into the "Billion Token Law", analyze its impact on AI training, and explore how this grand goal can be achieved through the power of communities and decentralized technologies.
According to a report by the University of California–San Diego,《Amount of Information Consumed by the Average Person》 the average American consumes about 34 gigabytes of data & information every day. That estimated to be the equivalent of 100,000 words heard or read every day– or about how many words in J.R.R. Tolkien’s The Hobbit (95,356 words).
Human learning is still limited by physical and mental constraints. Considering all visual, auditory, and sensory signals, the amount of information an average human can intake in a day is limited. However, for AI models, they can learn in parallel, and their learning bandwidth is theoretically infinite, allowing them to acquire vast amounts of information and knowledge in a very short time. Therefore, the 1 Billion Token Law is both a call to action and a definite path towards AGI. Everyone can contribute to the capabilities and intelligence of AI agents by providing effective training data. With AI's limitless learning ability and efficiency, valuable training data will continue to be a scarce asset in the future. In our vision, we initiate this call to action with the aim of decentralizing the recording of everyone's contributions to AGI from day one, in order to achieve effective allocation of the value created by AI.