Hadoop的实现查找辞典中的变位词的程序

五月 20th, 2011 by klose | Posted under mapreduce.

在Programming peals的第一部分介绍了实现变位词的程序。但是,如果处理的海量辞典,单独的机器恐怕很难给出一个合适的答案。于是,我尝试使用Hadoop实现了这个程序。
Map阶段:拆分单词,输出key-value,key:单词内字母重新排序之后的word value:原单词。例如:
apple  —map—> <aelpp, apple>
Reduce阶段:相同key的<key,value>代表着相同的变位词。于是,我们将所有Iterator获取所有的value,将其添加到HashSet当中,重复的单词只记录一次。

   package cn.ict.dpg;
 import java.io.IOException;
 import java.util.Arrays;
 import java.util.HashSet;
 import java.util.Iterator;
 import java.util.StringTokenizer;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 import org.apache.hadoop.util.GenericOptionsParser;

 class wordMapper extends Mapper<Object, Text, Text, Text> {

     @Override
     public void map(Object key, Text value,
             org.apache.hadoop.mapreduce.Mapper.Context context)
             throws IOException, InterruptedException {
         // TODO Auto-generated method stub

         //super.map(key, value, context);
         StringTokenizer st = new StringTokenizer(value.toString().trim(), " \t\n\r\f%;%,%-%'%~");
         String outputValue = "";
         String outputKey = "";
         while(st.hasMoreTokens()) {
             outputValue = st.nextToken().trim();
             if(outputValue.equals(""))
                 continue;
             char a[] = outputValue.toCharArray();
             Arrays.sort(a);
             outputKey = String.valueOf(a);
             context.write(new Text(outputKey), new Text(outputValue));
         }
     }
 }

 class wordReducer extends Reducer<Text, Text, Text, Text>
 {

     public void reduce(Text key, Iterable<Text> values,
             Context context)
             throws IOException, InterruptedException {
         // TODO Auto-generated method stub
         Iterator<Text> iter = values.iterator();
         String value = "";
         HashSet<String> set = new HashSet<String>();
         while(iter.hasNext()) {
             //System.out.println(iter.next());
             set.add(iter.next().toString());
         }
         context.write(key, new Text(set.toString()));
     }
 }

 public class wordExact  {
     public static void main(String [] args) throws IOException, InterruptedException, ClassNotFoundException {
         Configuration conf = new Configuration();
         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
         if (otherArgs.length != 2) {
           System.err.println("Usage: wordexact <in> <out>");
           System.exit(2);
         }
         Job job = new Job(conf, "word exact");
         job.setJarByClass(wordExact.class);
         job.setMapperClass(wordMapper.class);
         job.setCombinerClass(wordReducer.class);
         job.setReducerClass(wordReducer.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(Text.class);
         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
         FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
         System.exit(job.waitForCompletion(true) ? 0 : 1);
     }
 }

程序执行的情况:
输入数据文件在/tmp/input 当中
$bin/hadoop fs -put /tmp/input input
$bin/hadoop jar /tmp/wordExact.jar input output
查看结果:

.aelrsy    [[layers.]]
.aelry     [[layer., early.]]
.aemn    [[name.]]
.aemnnr    [[manner.]]
.aemnory    [[anymore.]]
.aemns    [[names.]]
.aemrsst    [[streams.]]

其实所谓的变位词还是很少的哈..

Hadoop的配置问题:在执行MapReduce Task的过程,最可能出现的情况是java.lang.OutOfMemoryError Java Heap Space.  这个错误一般是从org.apache.hadoop.mapred.Child: Error running child报出来的,在mapred-site.conf中添加

  <property>
   <name>mapred.child.java.opts</name>
      <value>-Xmx1024m</value>
   <description>Java opts for the task tracker child processes.  
   The following symbol, if present, will be interpolated: @taskid@ is replaced
   by current TaskID. Any other occurrences of '@' will go unchanged.
   For example, to enable verbose gc logging to a file named for the taskid in
   /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
         -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc

   The configuration variable mapred.child.ulimit can be used to control the
   maximum virtual memory of the child processes.
   </description>
 </property>

上面将hChild JVM的heapsize的值调整成了1G,注意调整Hadoop的配置参数哦/

完.

From Binospace, post Hadoop的实现查找辞典中的变位词的程序

文章的脚注信息由WordPress的wp-posturl插件自动生成





Do you have any comments on Hadoop的实现查找辞典中的变位词的程序 ?