微信公众号实时抓取

DurkBlue2019-12-275350

默认

摘要： 很多业务可能都会抓取微信公众号。而有些时候由于对方app或者我们技能的限制，导致并不能简单的脱壳处理。今天我们换一种思路进行公众号抓取。抓取思路整理使用Appium自动化控制手机，...

很多业务可能都会抓取微信公众号。

而有些时候由于对方app或者我们技能的限制，导致并不能简单的脱壳处理。

今天我们换一种思路进行公众号抓取。

抓取思路整理

使用Appium自动化控制手机，模拟用户对微信公众号列表进行相关操作

使用mitmproxy中间人代理拦截内容，解析出公众号列表页

使用python对公众号内容进行抓取

源码地址

微信公众号抓取项目地址

关键源码解析

appium部分。首先我们需要找出每个界面所对应的Activity和每个Activity界面的按钮。

from appium import webdriver

import time

from selenium.webdriver.support import expected_conditions as EC

from appium.webdriver.common.touch_action import TouchAction

from selenium.webdriver.common.by import By

from selenium.webdriver.support.wait import WebDriverWait

desired_caps={

"platformName": "Android",

"deviceName": "a",

"appPackage": "com.tencent.mm",

"appActivity": "com.tencent.mm.ui.LauncherUI",

'noReset':True

}

url = 'http://localhost:4723/wd/hub'

driver = webdriver.Remote(url,desired_capabilities=desired_caps)

driver.wait_activity('.ui.LauncherUI',timeout=10)

WebDriverWait(driver, 10).until(

EC.presence_of_element_located((By.XPATH, '//*[@text="通讯录"]'))

).click()

driver.find_element_by_xpath('//*[@text="公众号"]').click()

driver.wait_activity('.plugin.brandservice.ui.BrandServiceIndexUI',timeout=10)

while True:

try:

items = driver.find_elements_by_xpath('//*[@resource-id="com.tencent.mm:id/a2y"]')

for item in items:

item.click()

driver.wait_activity('.ui.chatting.ChattingUI',timeout=10)

driver.find_element_by_id('com.tencent.mm:id/jy').click()

driver.wait_activity('.plugin.profile.ui.ContactInfoUI',timeout=10)

# driver.find_element_by_id('com.tencent.mm:id/b0u').click()

TouchAction(driver).press(x=569, y=2000).move_to(x=390, y=792).release().perform()

driver.find_elements_by_xpath('//*[@resource-id="com.tencent.mm:id/b0r"]')[-1].click()

driver.wait_activity('.plugin.profile.ui.ContactInfoUI', timeout=10)

driver.back()

except Exception as e:

pass

time.sleep(1)

mitm代理中间人代码

import sys

sys.path.append('..')

sys.path.append('../..')

sys.path.append('../../..')

import re

import redis

from wechat.settings import QUEUES

QUEUE_CONF = QUEUES['tasks']

r = redis.Redis(**QUEUE_CONF)

class WeChatProxyHandler():

url = 'https://mp.weixin.qq.com/mp/profile_ext?action=home'

def response(self,flow):

if (flow.request.url.find(self.url))!=-1:

for line in flow.response.text.split('\n'):

line = line.strip()

if line.find('var msgList') != -1:

line = eval(re.sub('"', '"', line[len('var msgList = ') + 1:-2]))

urls = [item.get('app_msg_ext_info', {}).get('content_url') for item in line['list']]

urls = [re.sub('\\\/', '/', url) for url in urls if url]

r.lpush('wechat', *urls)

addons=[

WeChatProxyHandler()

总结

中间人代理可以帮助我们做很多事情

使用splash的时候可以把请求耗时的内容给拦截掉

通过js注入，可以实现自动分页抓取

此篇文章由DurkBlue发布，撰文不易，转载请注明来处

标签：网站搭建站长小程序微信开发推广 DurkBlue php知识 web前端微信公众号系统开发

文章投稿或转载声明

来源:DurkBlue版权归原作者所有，转载请保留出处。本站文章发布于 2019-12-27
温馨提示：文章内容系作者个人观点，不代表DurkBlue博客对其观点赞同或支持。

打赏

海报

阅读

微信公众号实时抓取

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

相关推荐

linux当中1panel安装说明

麒麟系统docker创建emqx报错Failed to create thread: Operation not permitted (1) Please ensure it is running

麒麟系统当中docker安装tdengine容器报错TFS ERROR failed to mount /var/lib/taos/ to FS since Operation not permitt

nginx 将访问/fljc/public/下的所有除了css和js文件和图片资源之外 路由自动重写/fljc/public/index.php/下所有响应路径

ZLMediaKit在linux X86_64上的部署与启动

记录离线方式实现docker版本的升级 docker-compose升级版本

在js里引用地图发现 is invoked via document.write. The network request for this script MAY be blocked by the

记录phpStudy环境软件数据库管理器phpMyAdmin4.8.5 更改为远程数据库地址连接

麒麟系统安装docker时发现yum install-y yum-utils报错 No match for argument：Unable to find a match

JAR包在linux系统中后台进行常驻并开机自启动 jar包自启jar包后台启动jar自启动jar后台启动

nginx 将访问/fljc/public/下的所有除了css和js文件和图片资源之外路由自动重写/fljc/public/index.php/下所有响应路径