Let’s not mince words: You use GitHub to find code to reuse within your own projects. We should also note that GitHub, while the largest and best repository for finding code you can pinch for your own apps and services, is sometimes terrible at surfacing relevant results. Welcome to the CodeSearchNet challenge!

“Search engines for code are often frustrating and never fully understand what we want,” GitHub admits. The CodeSearchNet challenge hopes to eliminate (or drastically reduce) that frustration. To help things along, GitHub released a large portion of its own dataset to assist data scientists and machine-learning engineers in building models. The challenge is open to anyone.

Github also released its data processing pipeline to challenge participants, which helps even the playing field. And there is a playing field: Github has a leaderboard tracking mean cumulative gain scores, as well as scores for Go, Java, Python, JavaScript, PHP, and Ruby.

It also aims to utilize natural language search, prioritizing ‘normal’ search terms over rigid, formulated queries. It would probably be easier for GitHub to ‘fix’ its search by funneling users into some dogmatic query form (something like ‘language, version, issue to solve’ may work) but the ultimate aim is to branch its search beyond code. “We anticipate other use cases for this dataset beyond code search and are presenting code search as one possible task that leverages learned representations of natural language and code,” it notes in a blog post.

This is possibly the biggest improvement GitHub could make, and is the latest in a swarm of updates made to the platform since being acquired by Microsoft. Most recently, the company updated its Learning Lab to all subject matter. Its token-scanning feature added outside providers (such as Dropbox).

If that wasn't enough, GitHub narrowed the scope of Actions to CI/CD. Sponsors allows for monetization, and Package Registry makes managing large repos simpler. There’s also the Spectrum acquisition, which could challenge Stack Overflow in a meaningful way once a code-search feature is pounded into shape.

Those are all wonderful features, but finding relevant repos is still paramount. As GitHub admits, “there is no standard dataset suitable for code search evaluation,” so we hope this challenge creates such a thing.