AC自动机之C#网页爬虫1.0(第一天总结)
目前已经有了一个大题思路了,由于晚上的时候兴起想写这么一个程序,然后百度之得:可以实现,遂代码搞起
基本思路:
1、抓取百度的搜索结果,解析网页链接,得到需要跳转的页面
2、然后进一步解析博客等结果页面,抓取代码,保存代码为文件
3、这样就得到了代码,也就实现了AC自动机的第一步,在之后程序中拿这些代码然后自动提交OJ
ok,相关文档后面陆续添加之
PS:第一天写的代码把百度的搜索结果也保存了,第二天的时候把这个保存操作注释掉了,那东西感觉保存下来没什么用,还占地方,最后拿到结果链接就行了。然后添加读入题目范围的操作,其他也都基本没变,代码已更新。
last modified by jtahstu on 2015/8/11 23:13
第一天写的一个简短的代码,可以运行了
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using System.IO; using System.Net; using System.Text.RegularExpressions; namespace 爬虫 { class Program { static void Main(string[] args) { int st =0;//起始搜索题目 int en = 0;//终止搜索题目 Console.Write("请输入起始搜索题号:"); st = int.Parse(Console.ReadLine()); Console.Write("请输入终止搜索题号:"); en = int.Parse(Console.ReadLine()); for (int k = st; k <= en; k++) { Console.WriteLine("目前正在处理hdu第{0}题,Please wait.....",k); System.Net.WebClient client = new WebClient(); string problem = "hdu%20" + k.ToString(); string hproblem = "http://www.baidu.com/s?word=" + problem + "&pn=0"; Stream strm = client.OpenRead(hproblem); StreamReader sr = new StreamReader(strm); //不保存百度后的链接,保存那个东西没用 //string file = "百度搜索hdu" + k.ToString() + "题代码.html"; //StreamWriter sw = new StreamWriter(@"G:\ACM\OJ\result\"+file); string link = "hdu" + k.ToString() + "题处理后链接.txt"; StreamWriter sw2 = new StreamWriter(@"G:\ACM\OJ\link\" + link); string s = sr.ReadToEnd(); //sw.WriteLine(s); int index = 0; for (int i = 0; i < 18; i++) { string temp = "http://www.baidu.com/link?url"; int start = s.IndexOf(temp, index); int end = s.IndexOf("\"", start + 5); string ans = s.Substring(start, end - start); if (i % 2 == 0) { sw2.WriteLine(ans); } index = end + 5; } sr.Close(); //sw.Close(); sw2.Close(); Console.WriteLine("目前hdu第{0}题已经处理完毕,Please wait.....\n",k); } //http://www.baidu.com/link?url=ZdN5LcBLNMw4xK45gGwVEq-vyuuKQbXU7CJDF8YQixQbE1VveqWkUhGWUklD91Mcri3Q0gfwPk9fkjIIDnPWXBsfx54Wc9ynaaN0pOteUtS //System.Net.WebClient client2 = new WebClient(); //Stream page = client2.OpenRead("http://www.baidu.com/link?url=ZdN5LcBLNMw4xK45gGwVEq-vyuuKQbXU7CJDF8YQixQbE1VveqWkUhGWUklD91Mcri3Q0gfwPk9fkjIIDnPWXBsfx54Wc9ynaaN0pOteUtS"); //StreamReader pageer = new StreamReader(page); //string spage = pageer.ReadToEnd(); //StreamWriter wpage = new StreamWriter(@"G:\ACM\OJ\linkpage\result.html"); //wpage.WriteLine(spage); //page.Close(); //pageer.Close(); //wpage.Close(); } } }
一道题运行时会保存两个文件:1、百度搜索结果页面,2、解析后的链接代码
解析后代码有9个,如下这种形式:
http://www.baidu.com/link?url=vYvYvqzA2XFNoJJudlGF27aAGDANeFJoq4PqrL_iWQUQE-s8lzUsnoU9ynhc8UMVs_puIO98WuuExJJ2sfDXD_
http://www.baidu.com/link?url=52GJT0mmNYcljJN6QKFozmhmoex4F-mAiGr64VwBUTcMv0ymiC4NnkPDk5EzdtFcoMZGZSsJc3BQz6UI6-rT-zLEK6uwEBpwrZpofetlo_u
http://www.baidu.com/link?url=ALsNS2WuqBpGCvbRttZEsoWZwXSPIMaltkNs8bPa4QYKnG6Nlp4LtdCbjJXVfbQAD8bc7e_-5_aRGXIyC0jg6K
http://www.baidu.com/link?url=bqaClQcmmhTLWNOVogmJxdbDg3kKkMPotjpK5no2fqqRfejMqvue58ly1rL4LsxebNyj5pRr8Ph5VxtsL7EVmbbQEG8E1RA8svl1AW-5vH7
http://www.baidu.com/link?url=7wD71YKh9ipX3xiQ7tYJ-Pi9MIdeRLSKv9kPmMVgh27bfFUctDDzR-yN5e0sJJNc3OIShEZPI44L8Nlq98SF_K
http://www.baidu.com/link?url=d2MoBA5YWdAhgrrskwQOOg1GH-oCj64Y25Y-btXuGyyFBaW5r2eOz_OCbay-9vjBbZiub1XZN2f2HGgtDoo_QFXhp_bF9kWVCT2buJGz5Sa
http://www.baidu.com/link?url=Z7WkdQE0_9xpppAOm7R0WANfxX8uVSYLlpullLOs_EkQzi8tl3-cAngu1KegJwlDk0Qgw8Y2HVFgqwIwiweNln-OulphvnLPwhh5X8paPsy
http://www.baidu.com/link?url=3PW9dgcpPEX7F8uAFfMVFqrDs9KQqMUwe2c46iebLpt3B7_I7hIRUpYvIytKy6XqwtMEIdniFLLxg7ylSoU1xq
http://www.baidu.com/link?url=ROUmE9aYwyqYz3TXQuhIXExX_NJq5ccLf_DTTotT2o8wwo5Kv9kU7Ix1OwBsilh2idcHnrbaR6vczABvHDBCp_
运行时的照片:
ok,第一天就是这个样子,明天继续奋斗,继续更新...