scraping tweets for fun and profit
tl;dr: find spontaenous interactions by scraping tweets with given keywords
cost: $0
build time: 5 minutes (MVP)
Twitter can be a great way to connect with alike minds. But if you're here, you're (probably) not yet a Thought Leader™. Let's walk through a quick way to speed that up.
Imagine you're releasing a blog post soon. Or perhaps you're looking to connect with people joining an upcoming webinar. Or maybe you just want to grow your following in a niche.
With the below tool, you can find hundreds of relevant conversations to join¹ more or less instantly²
table of contents:
- #1.1 - setup if you haven't used the Terminal before
- #1.2 - setup if you're familiar with the Terminal
- #2 - usage
- #3 - write to Google Sheets
- #4 - use cases
#1.1 - setup if you haven't used the Terminal before
This setup uses some code I've written and some existing open source software. You don't need to have experience working with code to use it.
Open the Terminal (CMD+Space
, type Terminal
, and it will pop up), and run the following commands:
The first will check you have Python installed:
python -V
# if not (if it displays an error), run
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install python3
Next, we check if the Pip manager for Python libraries is installed:
pip list
# if not, run
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python get-pip.py
Cool that's all there is to it. Now you have Python and the package manager Pip installed. To use open source Python code in the future, you can use pip install
to add it to your computer.
#1.2 - setup if you're familiar with the Terminal
We use pip to install the Twint
library (and google-auth + gspread if you want to write to Google Sheets):
pip install twint google-auth gspread
Finally, we fetch the code I've written:
git clone git://github.com/alecbw/Find-Tweets-By-Keyword.git && cd Find-Tweets-By-Keyword
#2 - usage
I generally recommend you use a VPN or proxy if you have one.
Running the script is as simple as writing the keywords you want to be in the tweets you scrape:
python3 get_tweets_by_keyword.py -k "ROAS" "retargeting"
You can add various options, like a different result CSV filename, and a per-keyword limit of tweets scraped:
python3 get_tweets_by_keyword.py -k "problem with magento" "shopify bug" "woocommerce" -o "Tweets about eCom.csv" -l 50
Once you've narrowed in on your use case, you can set up highly specific queries. In the following example, you could find speakers and high profile attendees (e.g. only tweets with > 2 likes by verified posters) at the upcoming BlackHat conference:
python3 get_tweets_by_keyword.py -o "Blackhat Conf.csv" -k "BlackHat conference" "virtual BlackHat" "BlackHat talk" "BlackHat speaker" --since "2020-05-01 00:00:00" --until "2020-10-01 12:00:00" -m 2 -v true -l 500
A series of optional command line arguments are provided (only -k
/--keywords
is required):
Option | Description |
---|---|
'-k' or '--keywords' | A list of keywords (separated by spaces) that you want to search for; required=True |
'-o' or '--output_filename' | Set the output filename to something other than the default |
'-g' or '--output_gsheet' | Write to Google Sheets with the spreadsheet name you specify |
'-d' or '--deduplicate' | Remove duplicates from the output (uses tweet_id) |
'-s' or '--since' | Filter by posted date since a given date. Format is 2019-12-20 20:30:15 |
'-u' or '--until' | Filter by posted date until a given date. Format is 2019-12-20 20:30:15 |
'-l' or '--limit' | Limit the results per keyword provided |
'-m' or '--min_likes' | Limit the results to only tweets with a given number of likes |
'-n' or '--near' | Limit the results to tweets geolocated near a given city |
'-v' or '--verified' | Limit the results to tweets made by accounts that are verified |
'-q' or '--hide_output' | If you want to disable routine results logging; default=True |
'-r' or '--resume' | Have the search resume at a specific Tweet ID |
A list of all the twint
supported args in at the bottom of get_tweets_by_keyword.py
, as well.
#3 - write to Google Sheets
If you've read my Google Sheets API walkthrough, you can use those credentials to have this script easily write to Google Sheets. If you haven't, go set up the auth, as described there.
The beauty of the Google Sheets write is you can have one person responsible for running the script (or put it on a cronjob!) and have it write to a Sheet that is shared with others (e.g. a whole marketing team)
When setting up the Sheet:
- Don't forget: each write will overwrite the first tab.
- Remember you need to share the Sheet with your
gserviceaccount
email. - You'll need to
export
the GSheets keys to the local environment:
export GSHEETS_PRIVATE_KEY="-----BEGIN PRIVATE KEY-----superlongkeywithabunchofstuffinit"
export GSHEETS_CLIENT_EMAIL="theemailyousharedthesheetwith"
#4 - use cases
The core Twint library makes it easy to scrape lots of things from Twitter - a user's tweets, a user's followers, tweets that have emails in them, etc.
This code wraps Twint and does only one thing (and does it well): scrape tweets that contain any of a number of keywords.
If the former sounds more up your alley than the latter, I recommend you check out the Twint documentation.
A few use cases I like using this tool for:
- getting reviews for your / your competitors' product
- surfacing relevant convos to promote your content
- finding coupons
- getting software recommendations
- building Twitter lists of folks to follow
- get volume estimates (i.e. how many people are talking about [X,Y,Z])
- conference and webcast attendees
¹ I strongly recommend against trying to automate responses to scraped tweets. It will come across as inauthentic (which it is) and it will hurt your brand. If you really want to scale this to the moon, hire a social media manager to run it.
² The script processes 1500-2000 tweets/minute
Thanks for reading. Questions or comments? 👉🏻 alec@contextify.io