解决网爬工具爬取页面信息出现乱码的问题

时间:2009-12-22 15:42来源:未知作者:admin 点击: 次

分享到：

问题：网爬工具中自动搜集页面信息时，有的页面出现了出现乱码现象原因：读取页面信息是使用了错误的编码类型。C#.NET从现在的类中获取得来的编码信息有时是错误的，本人认为

问题：

　　网爬工具中自动搜集页面信息时，有的页面出现了出现乱码现象

　　原因：

　　读取页面信息是使用了错误的编码类型。C#.NET从现在的类中获取得来的编码信息有时是错误的，本人认为对不是Asp.Net的应用程序，它读过来的编码信息都是错误的。

　　解决：

　　思路：必须先在运行时获取得该页面的编码，再去读取页面的内容，这样得来的页面内容才不会出现乱码现象。

　　方法：

　　 1:使用ASCII编码去读取页面内容。

　　 2:使用正则表达式从读取的页面内容中筛选出页面的编码信息。上个步骤获取的页面信息可能会有乱码。但Html标志是正确的，所有可以从HTML标志中得到编码的信息。

　　 3.用正确的编码类型去读取页面信息。

　　假如哪位有更好的方法，请多赐教啊！

　　下面附上代码：

代码演示

　　using System;

　　using System.Collections.Generic;

　　using System.Text;

　　using System.Net;

　　using System.Web;

　　using System.IO;

　　using System.Text.RegularEXPressions;

　　namespace charset

　　{

　　 class Program

　　 {

　　 static void Main(string[] args)

　　 {

　　 string url = "http://www.gdqy.edu.cn";

　　 GetCharset1(url);

　　 GetChartset2(url);

Console.Read();

　　 }

　　 // 通过HttpWebResponse直接获取页面编码

　　 static void GetCharset1(string url)

　　 {

　　 try

　　 {

　　 WebRequest webRequest = WebRequest.Create(url);

　　 HttpWebResponse webResponse = (HttpWebResponse)webRequest.GetResponse();

string charset = webResponse.CharacterSet;

　　 string contentEncoding = webResponse.ContentEncoding;

　　 string contentType = webResponse.ContentType;

Console.WriteLine("context type:{0}", contentType);

Console.WriteLine("charset:{0}", charset);

Console.WriteLine("content encoding:{0}", contentEncoding);

//测试或取页面是否出现乱码

　　 //Console.WriteLine(getHTML(url,charset));

　　 }

　　 catch (UriFormatException ex)

　　 {

Console.WriteLine(ex.Message);

　　 }

　　 catch(WebException ex)

　　 {

　　 Console.WriteLine(ex.Message);

　　 }

　　 //使用正则表达式获取页面编码

　　 static void GetChartset2(string url)

　　 {

try

　　 {

　　 string html = getHTML(url,Encoding.ASCII.EncodingName);

　　 Regex reg_charset = new Regex(@"charsets*=s*(?[^""]*)");

　　 string enconding = null;

　　 if (reg_charset.IsMatch(html))

　　 {

　　 enconding = reg_charset.Match(html).Groups["charset"].Value;

　　 Console.WriteLine("charset:{0}",enconding);

　　 }

　　 else

{

　　 enconding = Encoding.Default.EncodingName;

　　 }

　　 //测试或取页面是否出现乱码

　　 //Console.WriteLine(getHTML(url,enconding));

　　 }

　　 catch (UriFormatException ex)

　　 {

Console.WriteLine(ex.Message);

　　 }

　　 catch(WebException ex)

　　 {

　　 Console.WriteLine(ex.Message);

　　 }

　　 //读取页面内容方法

　　 static string getHTML(string url,string encodingName)

　　 {

try

　　 {

　　 WebRequest webRequest = WebRequest.Create(url);

　　 WebResponse webResponse = webRequest.GetResponse();

　　 Stream stream = webResponse.GetResponseStream();

　　 StreamReader sr = new StreamReader(stream, Encoding.GetEncoding(encodingName));

　　 string html = sr.ReadToEnd();

　　 return html;

　　 }

　　 catch (UriFormatException ex)

　　 {

Console.WriteLine(ex.Message);

return null;

　　 }

　　 catch (WebException ex)

　　 {

Console.WriteLine(ex.Message);

　　 return null;

　　 }

http://www.gdqy.edu.cn页面的使用的编码格式是：gb2312

　　第一个方法显示的内容是：

　　context type:text/html

　　charset:ISO-8859-1

　　content encoding:

　　第二个方法显示的内容是：

　　charset:gb2312

所以第一个方法获取的信息是错误的，第二个方法是对的。

　　为什么第一个方法获取的的编码格式是：ISO-8859-1呢？

　　我用Reflector反射工具获取了CharacterSet属性的源代码，从中不难看出其原因。假如能获取出ContentType属性的源代码就不以看出其出错的原因了，但是搞了许久都没找出，假如那位那补上，那就太感谢了。

　　下面我附上Reflector反射工具获取了CharacterSet属性的源代码，有爱好的朋友看一看。

CharacterSet源码

　　public string CharacterSet

　　{

　　 get

　　 {

　　 this.CheckDisposed();

　　 string text1 = this.m_HttpResponseHeaders.ContentType;

　　 if ((this.m_CharacterSet == null) && !ValidationHelper.IsBlankString(text1))

　　 {

　　 this.m_CharacterSet = string.Empty;

　　 string text2 = text1.ToLower(CultureInfo.InvariantCulture);

　　 if (text2.Trim().StartsWith("text/"))

　　 {

　　 this.m_CharacterSet = "ISO-8859-1";

　　 }

　　 int num1 = text2.IndexOf(";");

　　 if (num1 > 0)

　　 {

while ((num1 = text2.IndexOf("charset", num1)) >= 0)

　　 {

　　 num1 += 7;

　　 if ((text2[num1 - 8] == ';') (text2[num1 - 8] == ' '))

　　 {

　　 while ((num1 < text2.Length) && (text2[num1] == ' '))

　　 {

　　 num1++;

　　 }

　　 if ((num1 < (text2.Length - 1)) && (text2[num1] == '='))

　　 {

　　 num1++;

　　 int num2 = text2.IndexOf(';', num1);

　　 if (num2 > num1)

{

　　 this.m_CharacterSet = text1.Substring(num1, num2).Trim();

　　 break;

　　 }

　　 this.m_CharacterSet = text1.Substring(num1).Trim();

　　 break;

　　 }

　　 return this.m_CharacterSet;

　　 }

http://www.cnblogs.com/xuanfeng/archive/2007/01/21/626296.html

上一篇：C++/CLI中有效使用非托管并列缓存
下一篇：通过COM来获取CookieContainer,简单又好用

分享到： QQ空间新浪微博人人网开心网更多

精彩图集

C++十六进制

深入单链表

共用体的定

c++获取进程

精彩文章

热点文章

解决网爬工具爬取页面信息出现乱码的问题

热门标签

赞助商链接