All Our N-gram are Belong to You
Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation , speech recognition,spelling correction , entity detection, information extraction, and others. While such models have usually been estimated from training corpora containing at most a few billion words, we have been harnessing the vast power of Google's datacenters and distributed processing infrastructure to process larger and larger training corpora. We found that there's no data like more data, and scaled up the size of our data by one order of magnitude, and then another, and then one more - resulting in a training corpus of one trillion words from public Web pages.
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,024,908,267,229 words of running text and are publishing the counts for all 1,176,470,663 five-word sequences that appear at least 40 times. There are 13,588,391 unique words, after discarding words that appear less than 200 times.
Watch for an announcement at the Linguistics Data Consortium (LDC ), who will be distributing it soon, and then order your set of 6 DVDs. And let us hear from you - we're excited to hear what you will do with the data, and we're always interested in feedback about this dataset, or other potential datasets that might be useful for the research community.
Update (22 Sept. 2006): The LDC now has the data available in their catalog.The counts are as follows:
File sizes: approx. 24 GB compressed (gzip'ed) text files Number of tokens: 1,024,908,267,229 Number of sentences: 95,119,665,584 Number of unigrams: 13,588,391 Number of bigrams: 314,843,401 Number of trigrams: 977,069,902 Number of fourgrams: 1,313,818,354 Number of fivegrams: 1,176,470,663
The following is an example of the 3-gram data contained this corpus:
ceramics collectables collectibles 55 ceramics collectables fine 130 ceramics collected by 52 ceramics collectible pottery 50 ceramics collectibles cooking 45 ceramics collection , 144 ceramics collection . 247 ceramics collection120
ceramics collection and 43
ceramics collection at 52
ceramics collection is 68
ceramics collection of 76
ceramics collection | 59
ceramics collections , 66
ceramics collections . 60
ceramics combined with 46
ceramics come from 69
ceramics comes from 660
ceramics community , 109
ceramics community . 212
ceramics community for 61
ceramics companies . 53
ceramics companies consultants 173
ceramics company ! 4432
ceramics company , 133
ceramics company . 92
ceramics company 41
ceramics company facing 145
ceramics company in 181
ceramics company started 137
ceramics company that 87
ceramics component ( 76
ceramics composed of 85
ceramics composites ferrites 56
ceramics composition as 41
ceramics computer graphics 51
ceramics computer imaging 52
ceramics consist of 92
The following is an example of the 4-gram data in this corpus:
serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 serve as the industrial 52 serve as the industry 607 serve as the info 42 serve as the informal 102 serve as the information 838 serve as the informational 41 serve as the infrastructure 500 serve as the initial 5331 serve as the initiating 125 serve as the initiation 63 serve as the initiator 81 serve as the injector 56 serve as the inlet 41 serve as the inner 87 serve as the input 1323 serve as the inputs 189 serve as the insertion 49 serve as the insourced 67 serve as the inspection 43 serve as the inspector 66 serve as the inspiration 1390 serve as the installation 136 serve as the institute 187 serve as the institution 279 serve as the institutional 461 serve as the instructional 173 serve as the instructor 286 serve as the instructors 161 serve as the instrument 614 serve as the instruments 193 serve as the insurance 52 serve as the insurer 82 serve as the intake 70 serve as the integral 68
No comments:
Post a Comment