Good question. It is for training and evaluating models that fix Python code after a dependency upgrade breaks it. If you have ever bumped numpy or pandas and watched old code stop working, this is the data to teach a model to make those fixes automatically. Each example pairs the broken code with the fix and a short note on the API change that caused it. There is a live demo in the collection if you want to see it fix real code.
Abhisek Behera PRO
Abhisek987
AI & ML interests
None yet
Recent Activity
repliedto their post 2 days ago
Every Python developer has hit this: you upgrade numpy or pandas, and code that worked yesterday breaks today.
I built an open dataset for exactly this problem. DepDoctor is 6,204 examples of Python code broken by a dependency upgrade, each paired with the fix and a short note on the API change that caused it. It is a mixture of real cases mined from public GitHub commits and synthetic cases generated from a database of known breaking changes.
A few things I tried to get right:
- 935 "leave it alone" examples, to teach a model restraint, not just what to change.
- Honest evaluation: a fine-tuned Qwen2.5-Coder-7B gets 62% of fixes fully correct. I report that, not just the 97% text-similarity score that hides the truth.
- The main failure mode, over-editing, is measured and explained rather than buried.
Dataset, fine-tuned model, and a live demo are all open in one place:
https://huggingface.co/collections/Abhisek987/depdoctor
Feedback welcome, especially from anyone working on code repair or API migration. posted an update 3 days ago
Every Python developer has hit this: you upgrade numpy or pandas, and code that worked yesterday breaks today.
I built an open dataset for exactly this problem. DepDoctor is 6,204 examples of Python code broken by a dependency upgrade, each paired with the fix and a short note on the API change that caused it. It is a mixture of real cases mined from public GitHub commits and synthetic cases generated from a database of known breaking changes.
A few things I tried to get right:
- 935 "leave it alone" examples, to teach a model restraint, not just what to change.
- Honest evaluation: a fine-tuned Qwen2.5-Coder-7B gets 62% of fixes fully correct. I report that, not just the 97% text-similarity score that hides the truth.
- The main failure mode, over-editing, is measured and explained rather than buried.
Dataset, fine-tuned model, and a live demo are all open in one place:
https://huggingface.co/collections/Abhisek987/depdoctor
Feedback welcome, especially from anyone working on code repair or API migration. updated a dataset 3 days ago
Abhisek987/depdoctor-datasetOrganizations
None yet