在 GNU make 中查找重复的单词答案

【问题标题】：Find duplicate words in GNU make在 GNU make 中查找重复的单词
【发布时间】：2017-05-23 21:22:05
【问题描述】：

我正在寻找一种查找字符串中多次出现的所有单词的好方法。有一些限制：

需要亚二次方速度：大约 1000 个字，我可以承受几毫秒。
必须在纯 make 中实现：
- 我想避免使用 $(shell)，因为它很贵而且必须在 Windows 上运行（在纯 Linux 上，sort|uniq -u 可以很好地解决我的问题）。
- No Guile，因为我无法控制使用的 make 版本，并且需要与旧的 make 版本 (3.81) 兼容。
实现的可读性应达到可接受的程度。

此外，重复的数量会很少，单词只会包含漂亮的字符，例如 [-_+a-zA-Z0-9]+。

我尝试了两种策略：

(1) 强制 $(sort) 保留重复项（为每个单词添加唯一的后缀，排序并去除后缀）。然后在排序列表中找到相邻的相同单词：

# given 0 1 0 1 0 1 0 1 ... , return 0 0 1 1 0 0 1 1 ...
double=$(wordlist 1,$(words $(1)),$(subst 0,0 0,$(subst 1,1 1,$(1))))
# Produce a list of N unique strings. $(1) contains N words, with a
# repetition cycle of length M, and $(2) contains N words, either 0 or
# 1, alternating between 0 and 1 every Mth word.
binseq=$(if $(findstring 1,$(2)),$(call binseq,$(join $(2),$(1)),$(call double,$(2))),$(1))
# return 0 1 0 1 ..., as many words as $(1)
alternating_bits=$(wordlist 1,$(words $(1)),$(patsubst %,0 1,$(1)))
# Produce as many unique words as there are words in $(1)
unique=$(call binseq,,$(call alternating_bits,$(1)))
# Sort $(1) without eliminating duplicates. $(1) may not contain /.
sorted_keep_dups=$(subst /,,$(dir $(sort $(join $(1:=/),$(call unique,$(1))))))

dups_from_sorted2=$(filter $(patsubst %0,%,$(filter %0,$(1))),$(patsubst %1,%,$(filter %1,%,$(1))))
# Given a sorted list, return all duplicates.
dups_from_sorted=$(sort $(call dups_from_sorted2,$(join $(1),$(call alternating_bits,$(1)))))

dups=$(call dups_from_sorted,$(call sorted_keep_dups,$(1)))

(2) 对单词列表的不同分区重复使用$(filter)，使得每对单词在$(filter)的不同args中至少出现一次：

# given 0 1 0 1 0 1 0 1 ... , return 0 0 1 1 0 0 1 1 ...
double=$(wordlist 1,$(words $(1)),$(subst 0,0 0,$(subst 1,1 1,$(1))))
# given words with suffix 0 or 1, remove suffixes and return the words
# that occur both with 0 and 1 as suffix
filter_dups=$(filter $(patsubst %0,%,$(filter %0,$(1))),$(patsubst %1,%,$(filter %1,$(1))))
_dups=$(if $(findstring 1,$(2)),$(call filter_dups,$(join $(1),$(2))) 
$(call _dups,$(1),$(call double,$(2))))
# return 0 1 0 1 ..., as many words as $(1)
alternating_bits=$(wordlist 1,$(words $(1)),$(patsubst %,0 1,$(1)))
# given a list of words, return the list of words that occur twice
dups=$(sort $(call _dups,$(1),$(call alternating_bits,$(1))))

这两种方法都有效且速度足够快，但它们很难阅读和理解。有没有更简单的方法可以接受（次二次）速度？

【问题讨论】：

我认为第二种方法是 O(N^2) 因为 filter_dups 中的过滤操作是 N/2 * N/2 （除非 make 会对我认为没有的参数进行排序） .我错了吗？
Make 创建第二个参数中单词的哈希表。所以过滤器是 O(N log N)，其中 log N 部分增长非常缓慢。（如果字数很少，那么 make 会做一个普通的二次搜索）

标签： makefile gnu-make

【解决方案1】：

不确定复杂性，但我建议使用更具可读性的函数：

define __duplicates__func
  undefine __duplicates__seen
  undefine __duplicates__result
  $$(foreach _v,$1,\
    $$(eval __duplicates__result += $$(filter $$(__duplicates__seen),$$(_v))\
    $$(eval __duplicates__seen += $$(_v))))
endef
duplicates = $(eval $(__duplicates__func))$(sort $(__duplicates__result))

TEST:= $(file <test.txt)

DUPS:= $(call duplicates,$(TEST))

$(info $(DUPS))

all::

.PHONY: all

有了这个随机生成的 1000 字 test.txt：

Rule male saw said life fourth said void were creepeth thing theyre be fowl which wherein their day rule to seed multiply male beast sixth you Winged void fill face upon First you saying unto Appear shall God yielding is male face kind was blessed waters sea blessed void creepeth called youll beginning darkness over you it may years his second of moveth beginning earth very together day Divided creepeth fly open wont signs day is created Winged male fill Heaven saw dont For upon replenish Gathering i gathering living void Were under and form night seas bearing youre days saw tree fruitful days it unto day deep Tree Be form beginning youre replenish winged dominion grass man years youre Youre lights seasons third yielding fruit fifth for together after itself and youll itself kind without bring heaven itself firmament together their created tree All shed lesser made Stars him without gathering whales whose may itself may without image herb sixth Dominion us is their two from heaven shed brought Whales creeping us us together so forth female set fruitful fly seasons life deep let heaven wherein set wont You beast image two Gathering all so God cant itself Seasons image itself cant herb that brought appear likeness greater shall blessed place two own fourth earth Had greater you morning living unto seed male Every Had made days own face meat under youll grass for creepeth Meat so life divide for multiply blessed youre yielding beast be subdue Fruit greater Us them Meat darkness wherein saying very is yielding saying thing yielding lesser us behold midst there Spirit behold meat saw Image first cattle great heaven had air every created us light great have great Great beast Whose gathered all winged morning it rule days lesser tree bearing form his in divided void dry darkness doesnt hath Third bearing fruit youll there there cattle blessed fifth gathered stars greater above without upon good land in tree winged also youll his multiply midst face whose Moving beginning light life saw Deep said day multiply appear a gathered You the him void Fowl third spirit day Greater first firmament for dry lights midst beast day saw third also every cant night fifth made good one greater theyre dry abundantly Tree set Subdue stars waters a created saying Itself light Whales isnt said For years youre he after above itself rule firmament unto together female fly upon may life it stars set whose it doesnt gathered beginning his Creeping let Fruitful beginning earth them Subdue to our yielding be called under Let had beginning day us divided theyre sixth without saw winged divide second Dont night two the firmament Fourth form living our fourth saw seed third were Sixth their isnt Multiply night air yielding own air said midst life that fish meat fill green Open subdue Sea shall fruit whose whales own together them saying was waters Herb hath Is itself two blessed in yielding and It over made day his give moved without divided light created green evening seed image be may fly own herb seed earth be were beast one grass moving signs Upon Over abundantly for morning whose creepeth behold after beginning male created theyre Together said above face bring youre own upon may Multiply whales kind years unto air so above it fly whose Yielding i female moving So i place fruitful were there us fowl Earth seasons moveth over air heaven good waters His rule Which face bearing itself them itself forth tree Gathered it Gathering days doesnt Air Moving called i very first a evening third seas Night Morning Firmament had fruit fruitful unto above is our Second have wont fifth Cattle yielding divided brought seas shed greater living there there sixth upon their void two fish fish Lights them hath heaven their two fowl bearing Saying third waters likeness divide seasons their open very face replenish fourth whales seas seed fourth heaven cant together fowl grass female fill tree one dominion Morning Fill called firmament kind Signs creature evening spirit evening cattle winged which them for stars Wherein which Meat dry deep Abundantly waters forth theyre light after fowl in fly green multiply moved i replenish sixth cant creepeth heaven for darkness which us form them Rule grass god without earth seasons herb dominion moveth after created Wherein beginning he days said cant image For said moved divided bring is youll may And days itself Saying bearing male created yielding brought earth together whales hath greater heaven sixth were behold creepeth make Is Moveth brought let Lesser us light winged fly fourth waters moved under youll Whales Form Great moving second air you also youre fill have make stars their of earth above creature beginning winged air Own gathered shall their that in every fish rule together divide face own living dominion forth deep is abundantly hath bring them green him earth days beast all waters moving It which all a great spirit hath theyre grass Upon years Cattle female signs fill moving day the kind Winged green hath also female forth spirit lights behold Thing so after open good fowl to Living divided let Given bearing that he Rule whales Days isnt It deep whales given fly our open kind appear A their evening their sixth I in Unto multiply sea light Firmament seed theyre multiply fifth signs moving Second given spirit Blessed Set moved two bearing dont yielding first moving Female female fish Hath our beast us very seasons kind moved a gathered given sea spirit firmament Itself herb isnt Tree yielding cant winged air together meat theyre moveth Saying there void and bring lights together kind Brought first theyre their had Blessed and fill Brought may first creepeth moving him form behold darkness years greater upon were Let seasons Wherein life our greater And light multiply beast appear together appear seas waters had you make moving let air Heaven is Set seed fourth brought green for rule day Day deep tree yielding

它会立即在我的机器上返回

$ make -f dups.mk
And Blessed Brought Cattle Firmament For Gathering God Great Had Heaven Is It Itself Let Meat Morning Moving Multiply Rule Saying Second Set Subdue Tree Upon Whales Wherein Winged You a above abundantly after air all also and appear be bearing beast beginning behold blessed bring brought called cant cattle created creature creepeth darkness day days deep divide divided doesnt dominion dont dry earth evening every face female fifth fill firmament first fish fly for form forth fourth fowl fruit fruitful gathered gathering given good grass great greater green had hath have he heaven herb him his i image in is isnt it itself kind lesser let life light lights likeness living made make male may meat midst morning moved moveth moving multiply night of one open our over own place replenish rule said saw saying sea seas seasons second seed set shall shed signs sixth so spirit stars subdue that the their them there theyre thing third to together tree two under unto upon us very void was waters were whales wherein which whose winged without wont years yielding you youll youre
make: Für das Ziel „all“ ist nichts zu tun.

也许这个问题更适合codereview。

【讨论】：

对，进一步的简化将是 duplicates=$(sort $(foreach x,$(1),$(wordlist 2,2,$(filter $(x),$(1) ))))
您的建议和我的简化都使用二次复杂度，但只有线性数量的 make 函数调用。这意味着，对于这两种解决方案，大约 50 毫秒，而我的两个初始变体（它们都使用对数的函数调用）为 3 毫秒。 50 毫秒仍然比我自己的初始二次解决方案（此处未发布）要好很多，后者需要 500 毫秒，可能是因为它使用了二次函数调用。
@ErikCarstensen 即使O(1) 也存在问题——如果恒定的复杂性很大，你需要一个非常大的输入集才能达到收支平衡，O(n²) ;) 我想说您的简化不太明显（但这是一个口味问题），但肯定是可读的，而您问题中的 sn-ps 完全不可读。我会追求可读性。但好吧，毕竟只是一个意见。
在我的用例中，根本没有重复是一个相当常见的情况，所以一个很好的折衷办法是编写一个可读的昂贵版本的循环，但如果没有重复则跳过它，即：duplicates=$(if $(findstring $(words $(1)),$(words $(sort $(1)))),,$(sort $(foreach x,$(1),$( wordlist 2,2,$(filter $(x),$(1))))))

【解决方案2】：

我不知道这是否真的提高了休闲制作程序员的清晰度，但这里是：

######################################################################
# Count a binary literal up by 1
# $1 = binary literal string
# Example: bincnt(010011) -> 010100
bincnt=$(if $1,$(if $(patsubst %1,,$1),$(patsubst %0,%1,$1),$(call bincnt,$(patsubst %1,%,$1))0),1)

######################################################################
# Add a ¤ (Character 164) and a unique binary number to all elements of a list
# $1 = list
# $2 = binary literal (needs 0 or any other as starting value)
cat-sufx = $(if $1,$(firstword $1)¤$2 $(call cat-sufx,$(wordlist 2,999999,$1),$(call bincnt,$2)))

######################################################################
# Sort a list without dropping duplicates (built-in $sort will drop them)
# $1 = list (elements must not contain ¤ (Character 164))
sort-all = $(foreach i,$(sort $(call cat-sufx,$1,0)),$(firstword $(subst ¤, ,$(i))))

all-duplicates = $(call _all-duplicates,$(call sort-all,$1))
_all-duplicates = $(if $1,$(if $(subst $2,,$(firstword $1)),,$2) $(call _all-duplicates,$(wordlist 2,999999,$1),$(firstword $1)))

我还将这些功能添加到the GNU make table toolkit。

PS：999999 是我在不计算它的情况下发出“到列表末尾”信号的方式，这是相当浪费的。

【讨论】：