Title: Senior Software AI Engineer/Architect
Location: Dallas, TX (onsite)
Duration: Contract
Job Description
This role requires deep, end-to-end understanding of how Large Language Models are built, trained, optimized, deployed, and operated.
Candidates must demonstrate hands-on experience beyond consuming hosted LLM APIs, with a strong grasp of the underlying ML theory, system trade-offs, and production realities of AI/ML solutions.
Mandatory Competency Areas (Non-Negotiable)
1. Foundations of LLMs (How They Actually Work)
Candidate must demonstrate first-principles understanding, including:
- Transformer architectures (attention, embeddings, positional encoding)
- Tokenization strategies and their impact on cost & performance
- Training vs inference behavior
- Loss functions, pre-training objectives, and alignment techniques (SFT, RLHF)
- Limitations: hallucinations, bias, context collapse, long-range degradation
2. Model Development & Adaptation
Hands-on experience with:
- Pre-training vs fine-tuning trade-offs
- Parameter-efficient tuning (LoRA, QLoRA, adapters)
- Quantization and pruning techniques
- Model evaluation beyond accuracy (task fitness, safety, robustness)
- Data curation, labeling strategies, and contamination risks. Model Development & Adaptation
3. Inference, Serving & Optimization
Strong understanding of:
- Inference pipelines and token generation mechanics
- KV caching, batching, streaming responses
- Throughput vs latency trade-offs
- Memory constraints and GPU utilization strategies
- Model parallelism (tensor, pipeline) and their failure modes
4. End-to-End AI/ML System Design
Ability to architect complete AI solutions, including:
- Data ingestion and preprocessing pipelines
- Training / fine-tuning workflows
- Model registry, versioning, and lineage
- Deployment strategies (canary, A/B, shadow traffic)
- Feedback loops for continuous improvement
5. Retrieval, Memory & Tool-Augmented Systems
In-depth experience with:
- Retrieval-Augmented Generation (RAG) design
- Embeddings lifecycle management
- Vector databases and hybrid retrieval
- Prompt/tool orchestration and agentic workflows
- Failure modes of RAG and mitigation strategies
6. MLOps, Observability & Reliability
Strong ownership mindset for production AI:
- Monitoring model quality drift and regressions
- Debugging hallucinations and retrieval failures
- Logging prompts, responses, and model metadata
- Cost tracking and optimization (token economics)
- Incident response for AI systems
7. Security, Ethics & Governance
Clear understanding of:
- Prompt injection and data leakage risks
- Training data privacy and IP protection
- Model abuse, misuse, and guardrails
- Regulatory and compliance considerations
- Responsible AI principles in production systems