Extracting Google-Indexed Web Site Pages Using MS Excel

June 27th, 2009 by payam

In the field of Internet marketing research or search engine optimization, it is very important to know which pages of your web site have been indexed in Google. Currently, by default there is no way to export indexed pages in Google to a convenient spreadsheet for further analysis. If you want to learn how to do this fairly easily, however, keep reading.Popular spreadsheet software such as MS Excel is readily available to accomplish this task. However, the problem is a bit challenging: “How can you extract indexed pages of our website to MS Excel?”

First, before we can proceed with this tutorial, we will illustrate the importance of extracting indexed pages:

  • Duplicate content analysis in search engine optimization (for example, finding similar pages in your website) is only possible if you can compare one page to another by having a list of URLs in your website.
  • Spotting the important and low value URLs in your website are key factors in optimizing traffic. Without a list of these indexed URLs, it will be very hard to improve.
  • Lastly, if you are concerned with your website security, by extracting all pages of your website which have been indexed in Google, you can easily know which URLs affect your web site’s security. Examples of such URLs are admin pages, download pages and customer information pages. By nature, Google indexes URLs if your server does not restrict them and if there are links pointing to them.

Currently, Google is the most powerful search engine on the planet, yet it does not offer tools to let any webmaster extract or easily download indexed pages. Compare this to Yahoo, with its Yahoo Site Explorer that offers tools for downloading indexed pages. However, note that this data is only good in Yahoo, and is NOT applicable for Google.

These tutorials apply only to MS Excel 2002- 2007; note that Google can only give up to 1000 URLs.

When you visit http://www.google.com to find the indexed pages of your website, it is not as easy as typing in your domain name and let Google provide you a complete list of indexed pages. You have to use proper syntax in order for Google to give the complete list of URLs.

The indexed pages of your domain can be determined by entering this syntax in the Google search box and then pressing “Google Search” site:domainname.com

For example, if your website is http://www.thisisyourwebsite.com , you have to enter it as: site:thisisyourwebsite.com (not including the www).

We do not include the www in the search query so we can capture all URLs, including the other sub-domains in your website (not only those that start with “www”).

Below is the screen shot of the Google search result (circled in red are the indexed URLs which you will need to extract; the other results are irrelevant to your site):

Also note that the Google search result by default shows only the first ten results. Extracting the indexed URLs of your site from the Google search result to your Excel spreadsheet involves the copy and paste method. So we need to change this to the maximum (which is 100), which will speed up the process. Follow this simple procedure:

Step 1: Enter site:thisisyourwebsite.com in the Google search box.

Step 2: Beside the Google search box, click “preferences.”

Step 3: In the section for number of results, change it from 10 to 100.

Step 4: Click “Save Preferences.”

This method does not involve Visual Basic programming in Excel, but plain manipulation using the built-in text functions.

Step 1. After entering the search query: site:thisisyourwebsite.com in the Google search box, select the appropriate portion of data in the Google search result.

Only select the indexed pages and nothing else; see the sample screen shot below (highlighted regions are the selected areas to be copied and pasted into the Excel spreadsheet).

Do not include Ads and Sponsored results in the text selection.

Step 2. Right click and copy, open MS Excel, and then on cell A1: Right click again, Paste special, and paste as “text.”

After pasting the data, it should look like this:

After pasting the data as text, we will now filter the column for relevant information:

Step 3: Click on cell A1, then on the Excel buttons, click “Data,” Filter, and finally Auto filter.

Step 4: On the auto filter, click the drop down arrow button and then click “Custom.” The Custom Auto Filter Dialog Box will then appear.

Step 5: On the first drop down menu under “Shows rows where #:” select “Contains,” and then on the second drop down menu type www.aspfree.com. This means you will need to filter only information containing www.aspfree.com – only the indexed URLs of the domain.

Step 6: On the Custom Auto filter Dialog Box, find the checkbox containing “AND” and also “OR.” Check the “AND” checkbox.

Step 7. Below the AND/OR checkboxes, there are two additional drop down menus. On the left (first) drop down, select “contains” and then on the right (second) drop down, type “cached.” Finally, when everything is set, click OK.

In case you found the above steps confusing, make sure the Custom Auto filter Dialog Box looks like this after following Steps 1 to 7:

If you stumbled into this article in the hope of learning what “goal seek” and “Solver” are in relation to MS Excel applications, then you’ve come to the right place. First we will define their use. Goal seek is used when you know the result (it could be an answer, or any numerical value) but you are not sure what combinations will arrive at it. It lets you analyze only one variable at a time. Like goal seek, you use Solver when you know the answer, but it lets you analyze several factors at once to arrive at your desired result.For an example using goal seek, say you have a budget of $1000 per month on a certain project. And your budget is being governed by a specific formula, say Budget=5x + y (I am just making up this example to relate it to reality). Say x is the food budget and y is the expenses for transportation.

Thus you might want to compute the expenses for transportation, given the total food expenses to be $100. With problems like this, you need goal seek for faster computation and to avoid algebraic manipulation of the variables, which can be very time consuming to do on an Excel worksheet.

Now suppose you do not know the food and transportation expenses, but you are given the following constraints: food expenses should not exceed $400, yet not be less than $100, while transportation expenses should not exceed $600 yet not be less than $200. Given these criteria, you are asked to solve for the optimal food and travel expenses, given that the overall budget does not exceed $1000.

This type of problem can be solved by the MS Excel Solver add-on. Note that Solver is an add-on, and should be installed to be used, unlike goal seek which is a built-in function of Excel.

If you do not have Solver installed under the Tools section of your MS Excel, try reading this Microsoft tutorial on Installing and Running Solver.

I ill give you real examples of how to apply goal seek in MS Excel. This tutorial is tested to work starting with MS Excel 2002. MS Excel might be evolving; I have seen many advanced features, particularly in Excel 2007. But this tutorial is simple, and the basic principles are still covered.

source :: aspfree

Posted in Internet, Programing, Scripts, Solution | 1 Comment »

One Response

  1. Extracting Google-Indexed Web Site Pages Using MS Excel | pc-aras | learnexcelfast.info Says:

    [...] See the example post: Extracting Google-Indexed Web Site Pages Using MS Excel | pc-aras [...]

Support us

Your Donation Will Be Used To Pay For Hosting Service. Help Keep Our Web Site Online With a Donation!

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.