【发布时间】:2015-07-08 11:17:26
【问题描述】:
我正在尝试从具有以下格式的 word 文档中提取文本并将数据插入 SQL 数据库。
Word 文档
Name of House: Aasleagh Lodge
Townland: Srahatloe
Near: Killary Harbour, Leenane
Status/Public Access: maintained, private fishing lodge
Date Built: 1838-1850, burnt 1923, rebuilt 1928
源代码
var wordApp = new Microsoft.Office.Interop.Word.Application();
var wordDoc = wordApp.Documents.Open(@"C:\Users\mhoban\Documents\Book.docx");
var txt = wordDoc.Content.Text;
var regex = new Regex(@"(Name of House\: )(.+?)[\r\n]");
var allMatches = regex.Matches(txt);
foreach (Match match in allMatches)
{
var nameValue = match.Groups[2].Value;
var townValue = match.Groups[2].Value;
SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["ConnectionString"].ToString());
SqlCommand com = new SqlCommand();
com.CommandText = "INSERT INTO Houses (Name, Townland) VALUES (@name, @town)";
com.Parameters.Add("@name", SqlDbType.NVarChar).SqlValue = nameValue;
com.Parameters.Add("@town", SqlDbType.NVarChar).SqlValue = townValue;
com.Connection = con;
con.Open();
com.ExecuteNonQuery();
con.Close();
}
这很好用,唯一的问题是我将如何编写代码来插入其他文本字段,例如这一行
var regex = new Regex(@"(Name of House\: )(.+?)[\r\n]");
在这种情况下插入房子的名称“Aasleagh Lodge”,但我该如何写这行来插入城镇?
我尝试用我需要的字段名称替换正则表达式中的“Townland”,但我最终得到了单个记录,每个记录只包含一个不同的列值。
有没有办法我可以同时插入数据,可能是使用列表或其他东西,这样就不会发生这种情况。
新的源代码
var wordApp = new Microsoft.Office.Interop.Word.Application();
var wordDoc = wordApp.Documents.Open(@"C:\Users\mhoban\Documents\Book.docx");
var txt = wordDoc.Content.Text;
using (var sr = new StringReader(txt))
{
var s = string.Empty;
var nameValue = new StringBuilder();
var townValue = new StringBuilder();
while ((s = sr.ReadLine()) != null)
{
if (s.StartsWith("Name of House"))
{
nameValue.Append(s.Split(new[] { ':' })[1].Trim());
}
else if (s.StartsWith("Townland"))
{
townValue.Append(s.Split(new[] { ':' })[1].Trim());
}
if (nameValue.Length > 0 && townValue.Length > 0)
{
SqlConnection con = new SqlConnection(ConfigurationManager.ConnectionStrings["ConnectionString"].ToString());
SqlCommand com = new SqlCommand();
com.CommandText = "INSERT INTO Houses (Name, Townland) VALUES (@name, @town)";
com.CommandText = "INSERT INTO Houses (Name) VALUES (@name)";
com.Parameters.Add("@name", SqlDbType.NVarChar).SqlValue = nameValue;
com.Parameters.Add("@town", SqlDbType.NVarChar).SqlValue = townValue;
com.Connection = con;
con.Open();
com.ExecuteNonQuery();
con.Close();
nameValue.Clear(); townValue.Clear();
}
}
}
数据库字段
[Id] NCHAR (10) NULL,
[Name] NVARCHAR (MAX) NULL,
[Townland] NVARCHAR (MAX) NULL,
[Near] NVARCHAR (MAX) NULL,
[Status] NVARCHAR (MAX) NULL,
[Built] NVARCHAR (MAX) NULL,
[Description] NVARCHAR (MAX) NULL,
[Families] NVARCHAR (MAX) NULL,
[Images] IMAGE NULL
【问题讨论】:
-
我不认为它可以用正则表达式来完成。您需要使用 SingleLine 选项来获取删除返回的所有行。问题是您的值在需要作为终止符返回的单词之间有空格。不用正则表达式我也能轻松做到。
-
@jdweng 你能给我举个例子吗?
标签: c# sql asp.net regex ms-word