Go Back   Zune Boards > Zune Discussions > Zune Games and Applications

Zune Games and Applications Discuss, request, share, and download Zune Games here. Having troubles with Zune Games? No problem, we can help.

Reply
 
Thread Tools
Old 01-22-2012, 08:14 PM   #1
runewake2
Experienced Zuner
 
runewake2's Avatar
 
Join Date: Jan 2011
Posts: 147
runewake2 will become famous soon enough
Send a message via Skype™ to runewake2
Default iScrape (Text Extractor)


Download: Latest Version from Skydrive

iScrape is a simple program that will read web pages and grab the text from them. It then saves them as plain text files and opens them with notepad. Simple.

This simplifies the process of grabbing a large number of ebooks from sites like fanfiction or similar.

This ignores most other data such as ads, images and more. For this reason the scrapper will not work on sites that use a significant amount of scripts or content (images, forms etc).

iScrape supports both console commands passed in via command line or passed in after the program has been run. If that makes no sense to you, just ignore it.

When you launch iScrape it will provide you a prompt to enter an address. This may either be the url of a webpage that you wish to grab the text from or the location of a plain text (.txt) file that contains a list (seperated by spaces) of the urls you want to read. If you don't understand the second version a sample "test.txt" file has been included that downloads 4 chapters of a sample book from FanFiction.

If you enter an invalid url or txt file the program will break and exit. If that happens all you need to do is fix whatever was wrong in your txt file and restart.

If you run into problems, just let me know. I plan on expanding this far beyond it's current scope.

This program takes advantage of the HTMLAgilityPack which is not developed by me.

Theoretically this should make getting all of those text files for your Zune eReaders a lot easier.


NOTE: Files are saved with as "[TITLE].txt" where [TITLE] is the title of website you downloaded them from. For those of you that don't know, the title of a page is what appears in your tab. This maintains the spaces in the name.

NOTE 2: This is written in C# and thus should work only for Windows machines.


I have tested this on FanFiction. Further tests will be done shortly but I figured I'd get this out there. This may have errors when reading from other sites.


An example of the wikipedia arictle on "Dice"

If you run into problems or have questions feel free to ask me.

Last edited by runewake2; 01-22-2012 at 08:54 PM. Reason: added an example




runewake2 is offline   Reply With Quote

Advertisement [Remove Advertisement]

Old 01-23-2012, 04:41 PM   #2
Prodigy
Jr. Member
 
Prodigy's Avatar
 
Join Date: Jun 2011
Location: In a box on the side of the road
Posts: 376
Prodigy is on a distinguished road
Send a message via MSN to Prodigy Send a message via Yahoo to Prodigy
Default

u r truly amazing. thank you so much




Prodigy is offline   Reply With Quote

Old 01-23-2012, 04:53 PM   #3
ZuFi
/me .
Senior Editor
GFX Crew
Moderator
Elite Zuner
 
ZuFi's Avatar
 
Join Date: Jan 2011
Posts: 2,030
ZuFi is a name known to allZuFi is a name known to allZuFi is a name known to allZuFi is a name known to allZuFi is a name known to all
Send a message via MSN to ZuFi Send a message via Skype™ to ZuFi
Default

Well, now I don't have to learn how to transfer eBooks or whatever. Great job.
__________________

To view links or images in signatures your post count must be 5 or greater. You currently have 0 posts.

To view links or images in signatures your post count must be 5 or greater. You currently have 0 posts.





ZuFi is offline   Reply With Quote

Old 01-23-2012, 08:51 PM   #4
runewake2
Experienced Zuner
 
runewake2's Avatar
 
Join Date: Jan 2011
Posts: 147
runewake2 will become famous soon enough
Send a message via Skype™ to runewake2
Default

I've done some more testing and it doesn't appear to work that well on Forums etc. For example if you try to scan this page you will only get the very bottom text "Powered by V..."

If you do have a site you really need to extract the text from I should be able to get something working. Or can release the source. The program itself is extremely simple.

Also, if anyone can find a link of how Visual Studio project files are compossed I should be able to have it automatically import them into reader that most people probably use. Starts with a K... kirarulzz. Anyway, I'm planning on getting it to add them to your project (perhaps sorted by site) and have it ready to deploy. That is where this is going. Hope it actually gets that far.


Looks nice, all I need to do is add some of these lines:

Quote:
<ItemGroup>
<None Include="books\Michael Cricton\Jurassic Park.txt">
<Name>Jurassic Park</Name>
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>
And everything should be good.

Last edited by runewake2; 01-23-2012 at 09:10 PM.




runewake2 is offline   Reply With Quote

Reply

Bookmarks

Thread Tools
no new posts