Sunday, 30 November 2014

What you have to know before requesting web scraping services?

Before you request web scraping services you have to know what are your needs (what data you need, structure of it and where you can find this data).

Step 1: Define what data you need?

Data needs depending on purpose, if you want to find new customers you probably need contact data from players in your industry. Also if you want to study your competitors you need to define who are they. Only after that you can select data sources (websites feeds or other electronic sources) for this extraction.

In many cases for discovering and defining data sources are used search engines like Google, Bing, Yahoo, and others.

Step 2: Structure of data

Data structure it’s directly linked to usage purpose. In many cases data structure it’s a table where a row represents an entity and a cell of this row represents a property of this entity. In other cases Data structure is a a chart or another graphic representation builder with data extracted from a web source.

Step 3: Number of data extraction

In many cases is needed one time data extraction. In other cases when you need a regular report, are needed periodically extractions.

If you have defined all of above points you are ready to request a quote and an amount estimation from this contact form.

Source: http://thewebminer.com/blog/2013/08/

Wednesday, 26 November 2014

Data Mining KNN Classifier

Q1   

Suppose a data analyst working for an insurance company was asked to build a predictive model for predicting weather a customer will buy a mobile home insurance policy. S/he tried kNN classifier with different number of neighbours (k=1,2,3,4,5). S/he got the following F-scores measured on the training data: (1.0; 0.92; 0.90; 0.85; 0.82). Based on that the analyst decided to deploy kNN with k=1. Was it a good choice? How would you select an optimal number of neighbours in this case?

1 Answer

It is not a good idea to select a parameter of a prediction algorithm using the whole training set as the result will be biased towards this particular training set and has no information about generalization performance (i.e. performance towards unseen cases). You should apply a cross-validation technique e.g. 10-fold cross-validation to select the best K (i.e. K with largest F-value) within a range. This involves splitting your training data in 10 equal parts retain 9 parts for training and 1 for validation. Iterate such that each part has been left out for validation. If you take enough folds this will allow you as well to obtain statistics of the F-value and then you can test whether these values for different K values are statistically significant.

See e.g. also: http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Falg_knn_training_crossvalidation.htm

The subtlety here however is that there is likely a dependency between the number of data points for prediction and the K-value. So If you apply cross-validation you use 9/10 of the training set for training...Not sure whether any research has been performed on this and how to correct for that in the final training set. Anyway most software packages just use the abovementioned techniques e.g. see SPSS in the link. A solution is to use leave-one-out cross-validation (each data samples is left out once for testing) in that case you have N-1 training samples(the original training set has N).

Source:http://stackoverflow.com/questions/21121509/data-mining-knn-classifier?rq=1

Wednesday, 19 November 2014

Is It Time to End Screen Scraping?

As the industry works to improve the way online banking information is shared with personal financial management apps, a debate is brewing over whether to end the decades-old practice of screen scraping.

Proponents of the popular method say it is a valuable supplement to direct data feeds that may be incomplete or out-of-date. But screen scraping also raises risk concerns, since like other data collection methods it requires consumers to cough up their banking credentials.

"I have not talked to a bank that hasn't confirmed it's a growing problem in their organization," said Jim Routh, the chairman of the products and services committee at Financial Services Information Sharing and Analysis Center.

Financial institutions worry that data aggregators may not take all the appropriate security precautions. According to the FS-ISAC, an industry organization, startups are entering the aggregation market without making security a higher priority.

Routh, who is Aetna's chief information security officer and a former global head of application and mobile security for JPMorgan Chase, said the upstarts do some things well, but "protecting credentials isn't necessarily high on their priorities." The problem is worsened by data aggregators that collect marketing data, such as the device a consumer is using, to understand their behaviors across channels, he said.

The FS-ISAC has proposed creating a standard application programming interface to share information from bank accounts. The API would serve as the conduit for data when consumers wish to use a web or mobile app to receive push bill reminders, to verify their bank accounts or for numerous other PFM use cases.

The proposed API would also be designed to reduce the storage of financial data. But if the industry embraces the model, it would be harder for aggregators to do screen-scraping.

For years, PFM companies have used this tool to obtain customers' banking account information. With consumers' permission, aggregators log in with the customer's user name and password to grab financial data and use it to populate the mobile or web app of the customer's choice — whether or not the bank supports the technique.

Yodlee, which works with more than 300 banks as well as startups, argues that there is a place and a need for aggregators to collect data through various techniques to provide the best customer experience.

Brian Costello, vice president of operations and security at Yodlee, said his company uses a combination of methods to gather customer account data. If it couldn't get data from a direct feed, it could also screen scrape.

If the industry moved to embracing only one data exchange method, Yodlee could be more vulnerable to the problem of receiving outdated information from the banks.

When a bank changes an annual percentage rate, if it doesn't update the data feed it sends to the aggregator right away, the PFM services that rely on that data will appear stale. (Services like Credit Karma, Mint and Wallaby, for example, rely on aggregation technology to recommend financial products to consumers according to price, among other things.)

Proper maintenance of data feeds, of course, takes time and money — resources many banks are short on. But delays could also result from the bankers' dilemma: On the one hand, they want to let customers aggregate their accounts to gather intelligence on their competitors. On the other hand, they may have reservations about their rivals collecting that same data in the battle for wallet share.

"Banks are under tremendous pressure to retain and obtain more clients," said Costello.

Screen scraping also has maintenance requirements, though. The FS-ISAC white paper draft said the approach "requires some coordination from the FI to allow what appears to be an automated attack against their application. To avoid blocking the aggregator's attempt to screen scrape the financial institution's application with this or other current security controls, a whitelist of aggregator IPs are set up and maintained by the FIs."

Like Costello, Marc West, president of digital channels at Fiserv, said a combination of data collection methods is better than a standard data exchange approach that might fail to extract the necessary information. Any data feed, said West, offers a limited set of data and information, while a scrape can enable a custom data extract.

But Aetna's Routh said moving to a real-time API model would improve a recurring issue caused by screen scraping: customer service hiccups. A consumer may call the company behind the personal financial app when a link to an account is broken. The PFM provider might tell him to call the bank, when the problem could lie with the aggregator not knowing of an update to the bank's code.

"The consumer gets in the middle of a customer service issue that is thorny at best and unsolvable at worst," Routh said. "Unfortunately that happens more frequently than anyone would like to it happen.

The new model, then, is "inevitable" in Routh's point of view because of the risk and economics involved. "This won't happen overnight," he said. "It needs some legs."

Kristin Moyer, a research vice president in industry advisory services and banking and investment services at Gartner, said she expects more banks to embrace APIs as a way to compete in a digital world.

Already financial institutions like Capital One, Agricole Bank and Fidor Bank are piloting and testing the OAuth specification, which lets banks keep ownership of the customer log-in data but requires them to make available an API. (The FS-ISAC is also promoting OAuth 2.0 as a way to strengthen aggregation security.)

"It's something we will see a lot more of in the next two to three years," said Moyer. "It's an exciting time…I think the use of APIs will enable us as an industry [to do things] that we never really imagined possible before."

LESSONS ABROAD

The move away from screen scraping has already happened in some countries that lack a data exchange standard. Regulators in Poland, for example, recently recommended the practice halt. Responding to the guidance, mBank is one of the banks that changed its aggregation roadmap.

The bank, which spun off from BRE Bank, had been piloting a PFM service with friends and family and has now suspended the pilot. It had, however, already made use of aggregation technology so consumers, who weren't customers of the bank, could get loan decisions from mBank within half an episode of "Modern Family." Indeed, the bank would screen scrape consumers' external bank accounts to make a loan decision within five to 15 minutes. Now, loan decisions have to be made at a branch or for a smaller dollar amount after a consumer sends the bank a copy of an electronic statement.

"Right now we have to put it on the shelf. We haven't killed it. We want to resurrect it," said Michal Panowicz, senior director at mBank.

Overall, he sounds calm about the setback. "This is a regulator decision," said Panowicz. "We have to respect that. …We have to live with them on good footing."

But that doesn't mean it has given up on aggregation. Payday lenders can continue to screen scrape financial data in order to make loan decisions in Poland — which makes it an uneven playing field.

"We will try to convey the logic that [screen scraping] cannot be stopped," said Panowicz.

He views it as a longer term game for something he believes is valuable to consumers. mBank like other banks wants to realize the true aggregation dream: letting customers quickly switch bank accounts and products if they wish.

"To be honest, it's the most exciting part about aggregation... to move accounts to us without spending a minute of physical labor," he said.

Source:http://www.americanbanker.com/news/technology/is-it-time-to-end-screen-scraping-1071118-1.html

Saturday, 15 November 2014

Screenscraping from Java using jsoup – effective data gathering from websites

In a recent article I discussed screenscraping in a in hindsight fairly clumsy way (http://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This information is readily available from dozens of websites. However, I could not find one hat offered the data in easy to process XML or CSV format – all websites had human consumers in mind.

Using screenscraping – we use a programmatic facility to consume the content that is intended to be displayed on screen to human users and subsequently process that content by extracting the required data from it. Some web-pages are easier to scrape than others – this depends on the richness of the HTML (the poorer the better for scraping), the required interactivity (JavaScript, AJAX – the less the better) and the structure used to present the data (tables, frequently despised by web developers, work rather well).

I came across a tool for screenscraping from Java, called jsoup – http://jsoup.org/. It turned out to be so incredibly easy to use – that I thouht I should share it.

Getting going with jsoup is as easy as can be:

1. download jsoup-1.6.1.jar (or whatever the latest version is) from http://jsoup.org/download

2. add this jar as a dependency in your project and/or application CLASSPATH

3. make use of jsoup in the code that does the screenscraping.

A simple example of code that uses jsoup (more examples on: http://jsoup.org/cookbook/):

One of the websites offering the overall medal count is http://www.databaseolympics.com/games/gamesyear.htm?g=26. The page looks as follows:

Image

Well, more importantly, the page looks like this:

Image

This means in terms of screenscraping: I will find the medal count for each country inside a TABLE element with styleclass pt8. Each country has a TR element. Only the first TR element does not represent a country score, as it is the table header. The first TD element in the TR represents the country. The name of the country can be retrieved as the text content from the A element in the TD. The next TD elements contain the numbers of medals in Gold, Silver, Bronze and Total.

The corresponding Java code with jsoup boils down to:

public static void main(String[] args) throws IOException, SQLException, InterruptedException {

        Document doc = Jsoup.connect(OlympicMedalMirrorProcessor.baseUrl + "?g=26").get();
        String title = doc.title();
        System.out.println(title);
        Element table = doc.select("table.pt8").get(0);
        Elements trs = table.select("tr");
        Iterator trIter = trs.iterator();
        boolean firstRow = true;
        while (trIter.hasNext()) {


            Element tr = (Element)trIter.next();
            if (firstRow) {
                firstRow = false;
                continue;
            }
            Elements tds = tr.select("td");
            Iterator tdIter = tds.iterator();
            int tdCount = 1;
            String country = null;
            Integer gold = null;
            Integer silver = null;
            Integer bronze = null;
            Integer total = null;
            // process new line
            while (tdIter.hasNext()) {

                Element td = (Element)tdIter.next();
                switch (tdCount++) {
                case 1:
                    country = td.select("a").text();
                    break;
                case 2:
                    gold = Integer.parseInt(td.text());
                    break;
                case 3:
                    silver = Integer.parseInt(td.text());
                    break;
                case 4:
                    bronze = Integer.parseInt(td.text());
                    break;
                case 5:
                    total = Integer.parseInt(td.text());
                    break;
                }

            }
            System.out.println(country + ": gold " + gold + " silver " + silver + " bronze " + bronze + " total " +
                               total);
        } //table rows

Source:http://technology.amis.nl/2011/08/03/screenscraping-from-java-using-jsoup-effective-data-gathering-from-websites/

Monday, 10 November 2014

Review: import.io’s New Scraping Process and Features

Web scraping Data platform import.io, announced last week that they have secured $3M in funding from investors that include the founders of Yahoo! and MySQL.

They also released a new beta version of the tool that is essentially a better version of their extraction tool, with some new features and a much cleaner and faster user experience.

First Impression

I’ve used the tool for a week and can say it is an improvement over the old version – which was a bit bulky and awkward. While still not exactly the most intuitive process, the development team at import.io has managed to slim down what was a relatively button heavy process, without sacrificing any of the functionality – they made the new workflow both simpler and more complicated at the same.

The new version features a simple tool bar across the top as opposed to the space hogging table and wizard from before, which is a large improvement on the pink and white of the previous version.

True, the loss of the wizard means there isn’t as much guidance as before (the pop-up help only appears on the first use), but the undo button means you don’t really need it. You can click around and experiment a bit with the different extraction options before settling down to do some real work.

Data Extraction

Once you’ve figured out how it works, the new version requires far fewer mouse clicks to get from the page to a table of data/API as shown in their homepage video.

All you need to do now is navigate to a website, click a single piece of data on the page – such as price, image, or URL – and their app will find all the other examples of similar data on the website, immediately creating a structured table of data.

download2

This latest version of the extractor also includes a important new feature labeled “Suggest Data”. Its important because it lets you extract all the data from a page, instantly creating a table of data that can be published as an API. This makes import.io very exciting and quick, I spent a long time playing with this and it worked on the majority of sites.

Advanced Features

Most non-programmer web scrapers struggle with complex sites that use JavaScript or iFrames, but import.io also now deals with this. In the basic mode you can toggle JavaScript and CSS on and off to help you see your data better.

If that doesn’t work, you can switch into an ‘advanced mode’ where import allows you to write your own XPath and RegExp. They’ve also added a source code view, though without the ability to click on the site and inspect element (like in Chrome) this feature isn’t particularly useful.

API Integration

Once you’ve created your scraper, there are a number of options for what you can do with it.

If you’ want you can just copy and paste the data into a spreadsheet or Download as CSV. You can also push your data directly Google Sheets, with import.io’s self generated formula.

For the rest of us, they have surfaced both the POST and GET requests for you and given you a JSON view which allows you to see how the data is returned, which is handy.

All this functionality is nice, and it’s clear they’re trying to cater to all technical levels, but it has made the API page somewhat messy and potentially confusing for newer or less technical users, but they should be able to get what they need.

Good with lots of Potential

Their new tool certainly isn’t perfect. There are still a few sites where manual row training is required and you can’t access the authentication feature (though you can still do this in the old version) or pagination.

Even if it’s not quite there yet, if import.io continue like this, they are well on its way to becoming the best data scraping platform on the market. Especially when you consider the “free for life” price tag.

Source:http://scraping.pro/review-import-ios-new-scraping-tools-features/

Wednesday, 5 November 2014

Why People Hesitate To Try Data Mining

What is hindering a number of people from venturing into the promising world of data mining? Despite so much encouragement, promotions, testimonials, and evidences of the benefits of online data collection, still only a handful take the challenge and really gain the pay offs it has to offer.

It may sound unthinkable that such an opportunity for success has been neglected by many. It may also sound absurd why many well-meaning individuals are hindered from enjoying the benefits of the blessings of the 21st century.

The Causes

After considerable observation and analysis of the human psyche, one can understand the underlying reasons behind the hesitance to try the profitable data mining service. The most common reasons why people are afraid to try new technology or why they remain passive and uninvolved are: fear; lack of knowledge; and pride.

Fear. The most paralyzing of human emotions is fear. It can, to some extent, cause a person to be insane, unprofitable, sick, and lost. Although fear is a normal reaction to certain stimuli and a natural feeling experienced by humans, it must always be monitored and controlled.  Usually, people share common fears, such as: fear of change; fear of anything new; and fear of the unknown.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/people-hesitate-try-data-mining/