提高列表和地图的合成速度答案

【问题标题】：Increase speed of composition from list and map提高列表和地图的合成速度
【发布时间】：2015-05-11 21:56:14
【问题描述】：

我使用 Dico 类来存储术语的权重和出现的文档 ID

public class Dico 
{
   private String m_term; // term
   private double m_weight; // weight of term
   private int m_Id_doc; // id of doc that contain term

   public Dico(int Id_Doc,String Term,double tf_ief ) 
   {
      this.m_Id_doc = Id_Doc;
      this.m_term = Term;
      this.m_weight = tf_ief;
   }
   public String getTerm()
   {
      return this.m_term;
   }

   public double getWeight()
   {
     return this.m_weight;
   }

   public void setWeight(double weight)
   {
     this.m_weight= weight;
   }

   public int getDocId()
   {
     return this.m_Id_doc;
   }                
}

我使用这种方法从Map<String,Double> 和List<Dico> 计算最终重量：

 public List<Dico> merge_list_map(List<Dico> list,Map<String,Double> map)
 {
    // in map each term is unique but in list i have redundancy

   List<Dico> list_term_weight = new ArrayList <>();

   for (Map.Entry<String,Double> entrySet : map.entrySet())
   {
       String key = entrySet.getKey();
       Double value = entrySet.getValue();

       for(Dico dic : list)
       {    
          String term =dic.getTerm();
          double weight = dic.getWeight();

          if(key.equals(term))
          {
             double new_weight =weight*value;                
             list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));
          }                  
       } 
    }
    return list_term_weight;
 }

我在地图中有 36736 个元素，在列表中有 1053914 个元素，目前这个程序需要大量时间来编译：BUILD SUCCESSFUL（总时间：17 分 15 秒）。

我怎样才能只从列表中获取与地图中的术语相同的术语？

【问题讨论】：

使用两张地图，而不是一张地图和一张列表。
如何初始化地图？您是否拥有列表中的所有术语，或者它是一个子集？
您在谈论编译时间和 BUILD SUCCESSFUL，即使您的问题显然是 运行时间 问题。你能确认一下吗？
是的，因为下一步是使用该术语对具有 SOM 神经元网络的节点进行分类

标签： java optimization arraylist hashmap

【解决方案1】：

您可以使用 Map 的查找功能，即 Map.get() ，因为您的地图将术语映射到权重。这应该有显着的性能改进。唯一的区别是输出列表的顺序与输入列表相同，而不是键在权重映射中出现的顺序。

public List<Dico> merge_list_map(List<Dico> list, Map<String, Double> map)
{
    // in map each term is unique but in list i have redundancy
    List<Dico> list_term_weight = new ArrayList<>();

    for (Dico dic : list)
    {
        String term = dic.getTerm();
        double weight = dic.getWeight();

        Double value = map.get(term);  // <== fetch weight from Map
        if (value != null)
        {
            double new_weight = weight * value;

            list_term_weight.add(new Dico(dic.getDocId(), term, new_weight));

        }
    }
    return list_term_weight;
}

基础测试

List<Dico> list = Arrays.asList(new Dico(1, "foo", 1), new Dico(2, "bar", 2), new Dico(3, "baz", 3));
Map<String, Double> weights = new HashMap<String, Double>();
weights.put("foo", 2d);
weights.put("bar", 3d);
System.out.println(merge_list_map(list, weights));

输出

[Dico [m_term=foo, m_weight=2.0, m_Id_doc=1], Dico [m_term=bar, m_weight=6.0, m_Id_doc=2]]

时序测试 - 10,000 个元素

List<Dico> list = new ArrayList<Dico>();
Map<String, Double> weights = new HashMap<String, Double>();
for (int i = 0; i < 1e4; i++) {
    list.add(new Dico(i, "foo-" + i, i));
    if (i % 3 == 0) {
        weights.put("foo-" + i, (double) i);  // <== every 3rd has a weight
    }
}

long t0 = System.currentTimeMillis();
List<Dico> result1 = merge_list_map_original(list, weights);
long t1 = System.currentTimeMillis();
List<Dico> result2 = merge_list_map_fast(list, weights);
long t2 = System.currentTimeMillis();

System.out.println(String.format("Original: %d ms", t1 - t0));
System.out.println(String.format("Fast:     %d ms", t2 - t1));

// prove results equivalent, just different order
// requires Dico class to have hashCode/equals() - used eclipse default generator
System.out.println(new HashSet<Dico>(result1).equals(new HashSet<Dico>(result2)));

输出

Original: 1005 ms
Fast:     16 ms  <=== loads quicker
true

【讨论】：

感谢您的帮助，但我必须在 map.term = dic.term 时将地图项与所有项进行比较并更新权重。
我对预期的输入和输出有点困惑，你能用小的工作示例更新问题吗，我相信我所做的与我的测试显示的一样。
好吧，我说你的解决方案非常有用，我很困惑，谢谢

【解决方案2】：

另外，检查地图的初始化。 (http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html) 地图的 rehash 性能代价高昂。

作为一般规则，默认负载因子 (.75) 提供了一个很好的时间和空间成本之间的权衡。较高的值会降低空间开销，但增加了查找成本（反映在大多数 HashMap 类的操作，包括 get 和 put）。预期的应考虑地图中的条目数及其负载因子帐户时设置其初始容量，以尽量减少重新哈希操作的数量。如果初始容量大于最大条目数除以负载因子，无重新哈希操作将永远发生。

如果要在一个 HashMap 实例中存储许多映射，则创建它具有足够大的容量将允许映射比让它执行自动重新散列更有效地存储为需要扩大表格。

如果您知道或大致了解您在地图中放置的元素数量，您可以像这样创建地图：

Map<String, Double> foo = new HashMap<String, Double>(maxSize * 2);

根据我的经验，您可以将性能提高 2 倍或更多。

【讨论】：

maxSize * 2 就尺寸而言可能有点矫枉过正，但会减少碰撞并提供更好的性能。如果您知道确切的最大尺寸，则使用默认加载因子，所需的最小尺寸为 maxSize * 4 / 3 + 1。

【解决方案3】：

为了使merge_list_map 函数高效，您需要实际使用Map 的本质：用于键查找的高效数据结构。正如您所做的那样，循环 Map 条目并在 List 中寻找匹配项，算法为 O(N*M)，其中 M 是映射的大小，N 是列表的大小。这肯定是你能得到的最糟糕的结果。

如果您首先遍历List，然后对于每个Term，在Map 和Map$get(String key) 中进行查找，您将得到O(N) 的时间复杂度，因为地图查找可以认为是 O(1)。

在设计方面，如果你可以使用Java8，你的问题可以翻译成Streams：

public static List<Dico> merge_list_map(List<Dico> dico, Map<String, Double> weights) {
    List<Dico> wDico = dico.stream()
            .filter  (d -> weights.containsKey(d.getTerm()))
            .map     (d -> new Dico(d.getTerm(), d.getWeight()*weights.get(d.getTerm())))
            .collect (Collectors.toList());
    return wDico;
}

新的加权列表是按照逻辑过程构建的：

stream()：将列表作为Dico 元素的流
filter()：仅保留 Dico 元素，其 term 在 weights 映射中
map()：为每个过滤后的元素创建一个具有计算权重的new Dico() 实例。
collect()：将所有新实例收集到一个新列表中
返回包含过滤后的Dico 的新列表以及新的权重。

性能方面，我针对一些文本进行了测试，the narrative of Arthur Gordon Pym 来自 E.A.Poe：

String text = null;
try (InputStream url = new URL("http://www.gutenberg.org/files/2149/2149-h/2149-h.htm").openStream())  {
    text = new Scanner(url, "UTF-8").useDelimiter("\\A").next();    
}
String[] words = text.split("[\\p{Punct}\\s]+");
System.out.println(words.length); // => 108028

由于书中只有 100k 个单词，因此只需 x10（initDico() 是从单词中构建 List<Dico> 的助手）：

List<Dico> dico = initDico(words);
List<Dico> bigDico = new ArrayList<>(10*dico.size());
for (int i = 0; i < 10; i++) {
    bigDico.addAll(dico);
}
System.out.println(bigDico.size()); // 1080280

构建权重图，使用所有单词（initWeights() 构建书中单词的频率图）：

Map<String, Double> weights = initWeights(words);
System.out.println(weights.size()); // 9449 distinct words

merging 1M 词对权重图的测试：

long start = System.currentTimeMillis();
List<Dico> wDico = merge_list_map(bigDico, weights);
long end = System.currentTimeMillis();
System.out.println("===== Elapsed time (ms): "+(end-start)); 
// => 105 ms

权重映射比您的要小得多，但它不应该影响时间，因为查找操作是准恒定的时间。

这不是函数的严格基准，但它已经表明merge_list_map() 的得分应该小于 1s（加载和构建列表和地图不是函数的一部分）。

为了完成练习，以下是上面测试中使用的初始化方法：

private static List<Dico> initDico(String[] terms) {
    List<Dico> dico = Arrays.stream(terms)
            .map(String::toLowerCase)
            .map(s -> new Dico(s, 1.0))
            .collect(Collectors.toList());
    return dico;
}

// weight of a word is the frequency*1000
private static Map<String, Double> initWeights(String[] terms) {
    Map<String, Long> wfreq = termFreq(terms);
    long total = wfreq.values().stream().reduce(0L, Long::sum);
    return wfreq.entrySet().stream()
            .collect(Collectors.toMap(Map.Entry::getKey, e -> (double)(1000.0*e.getValue()/total)));
}

private static Map<String, Long> termFreq(String[] terms) {
    Map<String, Long> wfreq = Arrays.stream(terms)
            .map(String::toLowerCase)
            .collect(groupingBy(Function.identity(), counting()));
    return wfreq;
}

【讨论】：

【解决方案4】：

您应该使用contains() 方法来处理list。这样你就可以避免第二个for。即使contains() 方法的复杂度为 O(n)，您也应该会看到一个小的改进。当然，记得重新实现equals()。否则你应该使用第二个Map，就像机器人建议的那样。

【讨论】：

【解决方案5】：

使用 Map 的查找功能，正如 Adam 指出的那样，并使用 HashMap 作为 Map 的实现 - HashMap 查找复杂度为 O(1)。这应该会提高性能。

【讨论】：