Since May 2012 when www.odesk.com/oconomy has gone leaving only web.archive.org snapshot, there's no source of information about the current state of oDesk marketplace.
Indirect information about number of contractors, average hour rates across skills and their demand now can be drawn only from oDesk tests. This information is spread over pages with descriptions of tests and it must be gathered in one place for future analysis. Scraping is tedious even for a pythonista with requests and html5lib.
Extract tests data with Scrapy
Scrapy is a framework intended to ease implementation of web spiders. Extraction of oDesk tests fit in 3 modules:
- items.py contains declarative description of data being extracted and it resembles models in Django a lot
- tests_spider.py extends implementation of Scrapy's spider with:
- defines regex rules for extraction of URLs of pages, which will be fetched and maps handlers
- implements handlers, which extract data using XPath and assigns them to objects from items.py
- settings.py contains project settings, e.g. average DOWNLOAD_DELAY between pages, number of CONCURRENT_REQUESTS_PER_DOMAIN, etc.
Example of running a tests spider, extracting data and saving into CSV file, is shown below:
$ scrapy crawl -o tests_apr7.csv -t csv tests 2013-04-07 23:48:13+0400 [scrapy] INFO: Scrapy 0.16.4 started (bot: otests) 2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2013-04-07 23:48:13+0400 [scrapy] DEBUG: Enabled item pipelines: 2013-04-07 23:48:13+0400 [tests] INFO: Spider opened 2013-04-07 23:48:13+0400 [tests] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2013-04-07 23:48:13+0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 2013-04-07 23:48:13+0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 ... 2013-04-08 00:20:47+0400 [tests] INFO: Closing spider (finished) 2013-04-08 00:20:47+0400 [tests] INFO: Stored csv feed (440 items) in: tests_apr7.csv 2013-04-08 00:20:47+0400 [tests] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 259496, 'downloader/request_count': 888, 'downloader/request_method_count/GET': 888, 'downloader/response_bytes': 1672291, 'downloader/response_count': 888, 'downloader/response_status_count/200': 446, 'downloader/response_status_count/301': 1, 'downloader/response_status_count/302': 441, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2013, 4, 7, 20, 20, 47, 939438), 'item_scraped_count': 440, 'log_count/DEBUG': 1334, 'log_count/INFO': 37, 'request_depth_max': 3, 'response_received_count': 446, 'scheduler/dequeued': 888, 'scheduler/dequeued/memory': 888, 'scheduler/enqueued': 888, 'scheduler/enqueued/memory': 888, 'start_time': datetime.datetime(2013, 4, 7, 19, 48, 13, 512624)} 2013-04-08 00:20:47+0400 [tests] INFO: Spider closed (finished)
It can be drawn from excerpt, that there're just 440 tests on oDesk. Let's do more analysis.
Analyzing data with Pandas
Pandas is a large library for data-analysis, based on Numpy. If Numpy is called "Matlab in Python", then Pandas is "R-language in Python".
Let's run interactive Python interpreter and load data from CSV file:
>>> import pandas as pd >>> import numpy as np >>> def default_value(typ, default, val): ... try: ... return typ(val) ... except ValueError: ... return default >>> def maybe_int(val): ... return default_value(np.int64, None, val.replace(',', '')) >>> def maybe_float(val): ... return default_value(np.float64, None, val) >>> tests = pd.read_csv('tests_apr7.csv', thousands=',', converters={ ... 'hourly_rate_max': maybe_float, ... 'hourly_rate_avg': maybe_float, ... 'percent_independent': maybe_float, ... 'average_qualificatinos': maybe_float, ... 'taken_test': maybe_int, ... 'passed_test': maybe_int, ... 'tests_taken': maybe_int, ... }) >>> tests <class 'pandas.core.frame.DataFrame'> Int64Index: 440 entries, 0 to 439 Data columns: hourly_rate_max 432 non-null values hourly_rate_avg 432 non-null values percent_independent 440 non-null values title 440 non-null values average_qualifications 440 non-null values taken_test 440 non-null values average_hours 435 non-null values passed_test 440 non-null values test_id 440 non-null values tests_taken 440 non-null values dtypes: float64(5), int64(4), object(1)
Now some interesting statistics can be devised.
10 tests with most contractors
Guess which test is the most popular.
>>> tests.sort_index( ... by=['passed_test'], ascending=False ... ).ix[ ... :, ['test_id', 'title', 'passed_test'] ... ][:10] test_id title passed_test 0 752 oDesk Readiness Test for Independent Contracto... 743081 439 511 U.S. English Basic Skills Test 345213 438 688 English Spelling Test (U.S. Version) 269360 435 545 Office Skills Test 114577 436 584 Windows XP Test 104314 434 693 English Vocabulary Test (U.S. Version) 88943 437 753 oDesk Readiness Test for Agency Contractors 84282 433 506 Email Etiquette Certification 60019 429 571 Telephone Etiquette Certification 48861 428 484 Call Center Skills Test 44063
See also oDesk Knowledgebase article What is the oDesk Readiness Test?
10 tests with highest average hourly rates
Show very interesting correlation between number of contractors and the average cost of the hour.
>>> tests.sort_index( ... by=['hourly_rate_avg'], ascending=False ... ).ix[ ... :, ['title', 'hourly_rate_avg', 'passed_test'] ... ][:10] title hourly_rate_avg passed_test 14 VB.NET Programming Skills Test (Hands-on progr... 49.50 5 253 Adobe FrameMaker 8 Test 47.75 36 131 Design Considerations for Mobile Web Applicati... 36.19 58 166 VLSI Test 34.00 48 143 Checkpoint Security Test 29.00 68 248 RDF Test 28.50 26 240 Knowledge of ColdFusion 9 Skills Test 28.49 50 29 PostgreSQL Test 28.10 199 266 Web Services Test 27.95 301 53 Cocoa programming for Mac OS X 10.5 Test 27.55 567
10 tests with most worked hours
Can be used to get the lower bound of total worked hours and amount of earned money till Apr 8, 2013.
>>> tests['total_hours'] = tests['passed_test'] * tests['average_hours'] >>> tests['total_earnings'] = tests['total_hours'] * tests['hourly_rate_avg'] >>> tests[tests['total_hours'] > 0].sort_index( ... by=['total_hours'], ascending=False ... ).ix[ ... :, ['title', 'total_hours', 'total_earnings'] ... ][:10] title total_hours total_earnings 0 oDesk Readiness Test for Independent Contracto... 308378615 2.692145e+09 439 U.S. English Basic Skills Test 196080984 1.815710e+09 438 English Spelling Test (U.S. Version) 144646320 9.922738e+08 435 Office Skills Test 70808586 4.935358e+08 436 Windows XP Test 61023690 5.339573e+08 434 English Vocabulary Test (U.S. Version) 52120598 3.841288e+08 437 oDesk Readiness Test for Agency Contractors 49642098 2.765065e+08 433 Email Etiquette Certification 44474079 3.740270e+08 429 Telephone Etiquette Certification 36499167 2.901684e+08 428 Call Center Skills Test 33884447 2.232985e+08
Comments !