登录页面[url1]:
https://investorservice.cfmmc.com/
验证码[url2]:
https://investorservice.cfmmc.com/veriCode.do
提交登录[url3]:
https://investorservice.cfmmc.com/login.do
因为有验证码,而且这验证码用程序自动识别难度较大,最后一位与背景色太接近,识别不出来,只能人肉提交了。
该网站特点:
1.使用了https
2.SESSIONID存在COOKIES中
3.访问验证码页不会生成COOKIES,在登录之前能取到COOKIES的,只有访问url1了。
处理方法:
第一步:
使用模拟get方式,访问url1,取得cookies
CookieContainer cookies = new CookieContainer();
string url = "https://investorservice.cfmmc.com/";
HttpWebRequest myHttpWebRequest = (HttpWebRequest)WebRequest.Create(url);
myHttpWebRequest.Timeout = 20 * 1000; //连接超时
myHttpWebRequest.Accept = "*/*";
myHttpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0;)";
myHttpWebRequest.CookieContainer = new CookieContainer(); //暂存到新实例
myHttpWebRequest.GetResponse().Close();
cookies = myHttpWebRequest.CookieContainer; //保存cookies
string cookiesstr = myHttpWebRequest.CookieContainer.GetCookieHeader(myHttpWebRequest.RequestUri); //把cookies转换成字符串
第二步:
使用模拟get方式,访问url2,并把验证码保存到本地,在模拟get方式时,要注意的是,把第一步得到的cookies也提交上去,要不然会和登录时的用户对不上,那么验证码也会验证失败,代码如下:
url = "https://investorservice.cfmmc.com/veriCode.do";
myHttpWebRequest = (HttpWebRequest)WebRequest.Create(url);
myHttpWebRequest.Timeout = 20 * 1000; //连接超时
myHttpWebRequest.Accept = "*/*";
myHttpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0;)";
myHttpWebRequest.CookieContainer = new CookieContainer(); //暂存到新实例
myHttpWebRequest.Method = "get";
myHttpWebRequest.CookieContainer = cookies;
HttpWebResponse myHttpWebResponse = (HttpWebResponse)myHttpWebRequest.GetResponse();
Stream stream = myHttpWebResponse.GetResponseStream();
FileStream writer = new FileStream(System.Web.HttpContext.Current.Server.MapPath("\\tmp\\vericode.jpg"), FileMode.OpenOrCreate, FileAccess.Write);
byte[] buff = new byte[512];
int c = 0; //实际读取的字节数
while ((c = stream.Read(buff, 0, buff.Length)) > 0)
{
writer.Write(buff, 0, c);
}
writer.Close();
writer.Dispose();
myHttpWebRequest.GetResponse().Close();
第三步:
本地用户人肉把下载下来的验证码填写后,模拟post提交到url3
需要提交的内容包括:用户名、密码、验证码、cookies
需要注意的是,这里是https,代码如下:
System.GC.Collect();//垃圾回收,回收没有正常关闭的http连接
string result = "";//返回结果
int timeout = 30;
string charset = "utf-8";
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream reqStream = null;
try
{
//设置最大连接数
ServicePointManager.DefaultConnectionLimit = 200;
//设置https验证方式
if (url.StartsWith("https", StringComparison.OrdinalIgnoreCase))
{
ServicePointManager.ServerCertificateValidationCallback =
new RemoteCertificateValidationCallback(CheckValidationResult);
}
/***************************************************************
* 下面设置HttpWebRequest的相关属性
* ************************************************************/
request = (HttpWebRequest)WebRequest.Create(url);
request.Method = "POST";
request.Timeout = timeout * 1000;
////设置代理服务器
//WebProxy proxy = new WebProxy(); //定义一个网关对象
//proxy.Address = new Uri(WxPayConfig.PROXY_URL); //网关服务器端口:端口
//request.Proxy = proxy;
//设置POST的数据类型和长度
request.ContentType = string.Format("application/x-www-form-urlencoded;charset={0}", charset);
byte[] res = System.Text.Encoding.GetEncoding(charset).GetBytes(data);
request.ContentLength = res.Length;
CookieContainer cc = new CookieContainer();
string[] arr_cookies = cookies.Split(';');
for (int i = 0; i < arr_cookies.Length; i++)
{
string[] arr_item = arr_cookies[i].Split('=');
cc.Add(new Uri(url), new Cookie(arr_item[0].Trim(), arr_item[1].Trim()));
}
request.CookieContainer = cc;
//往服务器写入数据
reqStream = request.GetRequestStream();
reqStream.Write(res, 0, res.Length);
reqStream.Close();
//获取服务端返回
response = (HttpWebResponse)request.GetResponse();
//获取服务端返回数据
StreamReader sr = new StreamReader(response.GetResponseStream(), System.Text.Encoding.GetEncoding(charset));
result = sr.ReadToEnd().Trim();
sr.Close();
}
catch (Exception e)
{
}
finally
{
//关闭连接和流
if (response != null)
{
response.Close();
}
if (request != null)
{
request.Abort();
}
}
之后的result 就是证监会保证金网站给返回的登录之后的页面,想要抓取数据的话,直接处理返回的这信息就可以了。
以上。
本文作者:老徐
本文链接:https://bigger.ee/archives/4.html
转载时须注明出处及本声明