Despite rapid advancements in Large Language Models (LLMs), they often produce hallucinated or detrimental outputs, necessitating alignment with human preferences. We address these challenges by introducing Step Chain-of-Thought (SCoT) to enhance semantic understanding by breakin
...
Despite rapid advancements in Large Language Models (LLMs), they often produce hallucinated or detrimental outputs, necessitating alignment with human preferences. We address these challenges by introducing Step Chain-of-Thought (SCoT) to enhance semantic understanding by breaking down complex instructions. Additionally, we combine Direct Preference Optimization (DPO) with Low-Rank Adaptation (LoRA) to improve alignment with user intent. DPO optimizes outputs based on human feedback, while LoRA, alongside careful tuning of learning rates and beta values, mitigates repetition issues seen with DPO alone. Our findings show that models fine-tuned with DPO with LoRA achieve superior alignment compared to those using only Supervised Fine-Tuning (SFT). However, automated evaluators like LLM-as-a-Judge struggle with nuanced SCoT assessments, underscoring the necessity of human evaluation for capturing the complexities of alignment. In task alignment for robotics, Full Fine-Tuning (FFT) excels in familiar tasks, while LoRA significantly improves adaptability to new scenarios, increasing the robustness. Moreover, combining ground truth with synthetic data, especially when using LoRA, achieves a balance between accuracy and adaptability, revealing the limitations of relying solely on synthetic data. These conclusions highlight the critical importance of well-aligned datasets, fine-tuning strategies, and careful parameter tuning for LLM alignment.