【发布时间】:2016-05-09 05:49:57
【问题描述】:
我创建了一个网络scraper,它从网站scraper 数据并将其存储在 CSV 文件中。但是,问题是网站上有一个列具有货币格式的值,其金额为 7,100 或 85,210。当我的代码执行并 scrapes 数据时,这些值分成两列,例如一列中的 7 和另一列中的 100。请检查随附的屏幕截图。 代码如下。
public class ComMarket_summary {
boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();
public static void createConnection() throws IOException {
System.setProperty("http.proxyHost", "191.1.1.202");
System.setProperty("http.proxyPort", "8080");
String tempUrl = "http://www.psx.com.pk/phps/mktSummary.php";
doc = Jsoup.parse(new URL(tempUrl), 1000);
System.out.println("Successfully Connected");
}
public static void parsingHTML() throws Exception {
for (Element table : doc.select("table.marketData")) {
Elements tables = doc.select("table.marketData");
table = tables.get(2);
File fold = new File("C:\\market_smry.csv");
fold.delete();
File fnew = new File("C:\\market_smry.csv");
for (Element trElement : table.getElementsByTag("tr")) {
trElement2 = trElement.getElementsByTag("tr");
tdElements = trElement.getElementsByTag("td");
FileWriter sb = new FileWriter(fnew, true);
//if (table.hasClass("marketData")) { //&&(tdElements.hasClass("tableHead")&&tdElements.hasClass("tableSubHead"))
for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
if (it.hasNext()) {
sb.append(" , ");
sb.append(" \r\n ");
}
for (Iterator<Element> it2 = tdElements.iterator(); it.hasNext();) {
Element tdElement2 = it.next();
final String content = tdElement2.text();
if (it2.hasNext()) {
sb.append(formatData(content));
sb.append(" , ");
}
}
System.out.println(sb.toString());
sb.flush();
sb.close();
}
System.out.println(sampleList.add(tdElements));
}
}
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("d-MMM-yy", Locale.US);
public static String formatData(String text) {
String tmp = null;
try {
Date d = FORMATTER_MMM_d_yyyy.parse(text);
tmp = FORMATTER_dd_MMM_yyyy.format(d);
} catch (ParseException pe) {
tmp = text;
}
return tmp;
}
public static void main(String[] args) throws IOException, Exception {
createConnection();
parsingHTML();
}
注意:我使用的是 windows 8,java 版本 1.8,jsoup 1.8
【问题讨论】:
标签: java web-scraping jsoup