如何根据现有字符串变量的子字符串在Stata中生成虚拟变量？答案

【问题标题】：How to generate a dummy variable in Stata based on a sub-string of an existing string variable?如何根据现有字符串变量的子字符串在Stata中生成虚拟变量？
【发布时间】：2020-12-17 19:55:16
【问题描述】：

我正在寻找一种方法来创建一个虚拟变量，该变量将一个名为 text 的变量与多个给定的子字符串（如“book, buy, Journey”）进行比较。

现在，我想检查观察中是否包含书籍、购买或旅程。如果在子字符串中找到这些关键字之一，则虚拟变量应为 1，否则为 0。一个例子：

                 TEXT
Book your tickets now
Swiss is making your journey easy
Buy your holiday tickets now!
A touch of Austria in your lungs.

期望的结果应该是

dummy variable
       1
       1
       1
       0

我用 strpos 和 regexm 进行了尝试，结果非常有限。

问候，

乔希

【问题讨论】：

标签： variables stata

【解决方案1】：

使用strpos 可能很乏味，因为您必须考虑大小写，所以我会使用正则表达式。

* Example generated by -dataex-. To install: ssc install dataex
clear
input str33 text
"Book your tickets now"            
"Swiss is making your journey easy"
"Buy your holiday tickets now!"    
"A touch of Austria in your lungs."
end

generate wanted = regexm(text, "[Bb]ook|[Bb]uy|[Jj]ourney")
list

结果：

. list

     +--------------------------------------------+
     |                              text   wanted |
     |--------------------------------------------|
  1. |             Book your tickets now        1 |
  2. | Swiss is making your journey easy        1 |
  3. |     Buy your holiday tickets now!        1 |
  4. | A touch of Austria in your lungs.        0 |
     +--------------------------------------------+

有关正则表达式的信息，另请参阅link。

【讨论】：