【问题标题】:How to scrape currency formatted numbers in CSV file through Java如何通过Java在CSV文件中抓取货币格式的数字
【发布时间】:2016-05-09 05:49:57
【问题描述】:

我创建了一个网络scraper,它从网站scraper 数据并将其存储在 CSV 文件中。但是,问题是网站上有一个列具有货币格式的值,其金额为 7,10085,210。当我的代码执行并 scrapes 数据时,这些值分成两列,例如一列中的 7 和另一列中的 100。请检查随附的屏幕截图。 代码如下。

public class ComMarket_summary {

boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();

public static void createConnection() throws IOException {
    System.setProperty("http.proxyHost", "191.1.1.202");
    System.setProperty("http.proxyPort", "8080");
    String tempUrl = "http://www.psx.com.pk/phps/mktSummary.php";
    doc = Jsoup.parse(new URL(tempUrl), 1000);
    System.out.println("Successfully Connected");
}

public static void parsingHTML() throws Exception {

    for (Element table : doc.select("table.marketData")) {
        Elements tables = doc.select("table.marketData");
        table = tables.get(2);
        File fold = new File("C:\\market_smry.csv");
        fold.delete();
        File fnew = new File("C:\\market_smry.csv");
        for (Element trElement : table.getElementsByTag("tr")) {

            trElement2 = trElement.getElementsByTag("tr");
            tdElements = trElement.getElementsByTag("td");
            FileWriter sb = new FileWriter(fnew, true);

            //if (table.hasClass("marketData")) { //&&(tdElements.hasClass("tableHead")&&tdElements.hasClass("tableSubHead"))
            for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
                if (it.hasNext()) {
                    sb.append(" , ");
                    sb.append(" \r\n ");
                }

                for (Iterator<Element> it2 = tdElements.iterator(); it.hasNext();) {
                    Element tdElement2 = it.next();
                    final String content = tdElement2.text();
                    if (it2.hasNext()) {

                        sb.append(formatData(content));
                        sb.append("   ,   ");
                        

                    }
                }

                System.out.println(sb.toString());
                sb.flush();
                sb.close();
            }

            System.out.println(sampleList.add(tdElements));

        }
    }
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("d-MMM-yy", Locale.US);

public static String formatData(String text) {
    String tmp = null;

    try {
        Date d = FORMATTER_MMM_d_yyyy.parse(text);
        tmp = FORMATTER_dd_MMM_yyyy.format(d);
    } catch (ParseException pe) {
        tmp = text;
    }

    return tmp;
}

public static void main(String[] args) throws IOException, Exception {
    createConnection();
    parsingHTML();

}

注意:我使用的是 windows 8,java 版本 1.8,jsoup 1.8

【问题讨论】:

    标签: java web-scraping jsoup


    【解决方案1】:

    在保存值之前使用String.replace去掉逗号

    value = value.replace (",", "");
    

    【讨论】:

    • 是否可以在 HTML 函数中使用 replace()?
    • 我的意思是文本 7,100 是一个 HTML td 元素文本。那么,我可以使用像final String content = tdElement2.text(); content = content.replace(","," ") 这样的 tdElement 替换吗?
    • 字符串内容 = tdElement2.text().replace(",", "");
    【解决方案2】:

    String.replace 会去掉你的逗号。虽然还有其他几个类似的函数(replaceAllreplaceFirst),但replace 会稍快一些,通常是单个字符的最佳选择。

    见:https://docs.oracle.com/javase/6/docs/api/java/lang/String.html

    还有:Difference between String replace() and replaceAll()

    【讨论】:

      猜你喜欢
      • 2011-01-23
      • 1970-01-01
      • 2011-11-12
      • 2017-03-23
      • 1970-01-01
      • 2022-08-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多