STRAND (Structural Translation Recognition for Acquiring Natural Data) is a system for automatically acquiring pairs of documents in parallel translation on the World Wide Web. It is very accurate, with evaluation of the most recent version of the system suggesting that on average only 1 in 20 document pairs (or fewer) will be deemed to be translations when in fact they are not (Resnik and Smith, 2002).
We would very much like to simply distribute parallel corpora acquired by STRAND, but because material on the Web is subject to copyright restrictions, unfortunately this is not possible. Instead, this page provides the next best thing: databases of URL pairs acquired by STRAND, which you can download yourself for personal use.
For the older databases (prior to July 2002), it is likely that more and more of the pages will have become unavailable, or that the underlying content will have changed. Use at your own risk! Beginning with the July 2002 English-Arabic bilingual database, however, we are generally providing persistent URLs: they point to pages at the Internet Archive, which provides permanently retrievable time-stamped snapshots of Web pages.
You are welcome to use any of the STRAND Bilingual Databases for any purpose, research or commercial, so long as you visibly acknowledge its use in any product or marketing literature and cite the following in any publications:
Philip Resnik and Noah A. Smith, The Web as a parallel corpus Computational Linguistics, Volume 29 , Issue 3 (September 2003), Pages: 349 - 380.That article gives technical details and performance assessments, and includes a discussion of related work on bitext mining by other researchers. It is a revised/extended version of
NO WARRANTY WE PROVIDE ABSOLUTELY NO WARRANTY OF ANY KIND EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM OR DATABASE IS WITH YOU. SHOULD THIS PROGRAM OR DATABASE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. IN NO EVENT SHALL THE AUTHOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST MONIES, OR OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA OR ITS ANALYSIS BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD PARTIES) THE PROGRAM OR DATABASE. (Above NO WARRANTY modified from the GNU NO WARRANTY statement.)
Please report any bugs or problems to Philip Resnik, resnik@umiacs.umd.edu.
These data were collected via a search of several crawls on the Internet Archive in March 2003; details are discussed in Resnik and Smith (2003), see above.
For original Web URLs instead of Internet Archive URLs, download strand_enzh_0303_web.db.gz instead.
These data were collected via a search of roughly one sixth of Internet Archive as of January 2001; details are presented in (Resnik and Smith 2002). Formal evaluation on this set results in precision estimated at 95% and recall at 99%.
These data were collected by David Martinez (jibmaird@si.ehu.es) and colleagues from the IXA group in the Basque Country, using their implementation of the STRAND approach. The URL pairs listed have not been manually checked.
These data were collected by Jinxi Xu of BBN (jxu@bbn.com) using a modified version of STRAND. Performance based on formal evaluation at UMD is estimated at 98% precision and 61% recall.
This database is described in the ACL'99 paper cited above.