다음 for 루프의 "A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023" 숫자를 새로운 숫자 목록으로 바꾸려고 합니다. 하지만 내 새 목록은 수백 개의 숫자가 포함된 .CSV 파일입니다. 제 질문은 .CSV 파일을 직접 읽고 for 루프에서 목록으로 작동하게 할 수 있습니까?입니다.
for ACC in A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023
do
echo -n -e "$ACC\t"
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ACC}&rettype=fasta&retmode=xml" |\
grep TSeq_taxid |\
cut -d '>' -f 2 |\
cut -d '<' -f 1 |\
tr -d "\n"
echo
done
.csv 파일은 다음과 같습니다.
WP_004064712.1
WP_023555236.1
WP_051593235.1
KAJ52037.1
WP_012103448.1
WP_049740904.1
WP_003346264.1
WP_026134014.1
WP_051870539.1
AKF93952.1
XP_008397367.1
XP_014896959.1
XP_007567109.1
XP_014847432.1
EHG27035.1
EGX75147.1
WP_033630878.1
답변1
@Mark가 CSV 파일에 한 줄에 하나의 값을 포함해야 하는 경우 초기 목록을 명령 대체로 바꾸면 쉽게 이를 수행할 수 있습니다.
for ACC in `cat csvfile`
do
...
done
답변2
"A00002 X53307 BB145968 CAA42669 V00181 AH002406 HQ844023"을 어떤 값으로 바꾸려는지 알고 있는 경우 다음을 수행할 수 있습니다.
CSV=`cat csvfile`
for LINE in $CSV
do
sed -i "s/A00002/NewValue/g" $CSV
sed -i "s/X53307/NewValue/g" $CSV
...
done
sed 명령 설명:
sed -i "s/X53307/NewValue/g"$CSV
이 명령의 목적은 $CSV 파일에서 X53307을 NewValue로 직접 바꾸는 것입니다.
답변3
여기서 두 가지를 잊고 계십니다.
- Curl 문의 문자열 확장은 출력을 생성합니다.
- @John의 제안에 따라 CSV 파일을 입력 컨트롤로 사용할 수 있습니다.
따라서 문자열 값을 바꿀 필요 없이 덮어쓰기만 하면 됩니다.
오래된:
<?xml version="1.0"?>
<!DOCTYPE TSeqSet PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
<TSeq_seqtype value="nucleotide"/>
<TSeq_gi>39899</TSeq_gi>
<TSeq_accver>X53307.1</TSeq_accver>
<TSeq_taxid>1423</TSeq_taxid>
<TSeq_orgname>Bacillus subtilis</TSeq_orgname>
<TSeq_defline>Bacillus subtilis epr gene for a novel serine protease</TSeq_defline>
<TSeq_length>2521</TSeq_length>
<TSeq_sequence>GTTAACAGGATATCCGAGCTTATCGGCCCACTCGTTCCCAAACACACTCGCCATGAAATCAGCATACCCCGGAATCGGCAAGCTCGTTAAAATCAAGAAGACAGACCCGATAATAATCAGCGGCATGGACTGGATAATTCCGTCACGCAAAGCGCTGAGATGCCGCTGCCCGGCAATTTTCCCGGCGACAGGCATTATTTTTTCCTCCATCACCCGAGTGAATGTGCTCATCTTAAAAACCCCCTTTTCTCATTGCTTTGTGAACAACAACCTCCGCAATGTTTTCTTTATCTTATTTTGAAAACGCTTAGAAATTCATTTGGAAAATTTCCTCTTCATGCGGAAAAAATCTGCATTTTGCTAAACAACCCTGCCCATGAAAATTTTTTCCTTCTTACTATTAATCTCTCTTTTTTTCTCCGATATATATATCAAACATCATAGAAAAAGGAGATGAATCATGAAAAACATGTCTTGCAAACTTGTTGTATCAGTCACTCTGTTTTTCAGTTTTCTCACCATAGGCCCTCTCGCTCATGCGCAAAACAGCAGCGAGAAAGAGGTTATTGTGGTTTATAAAAACAAGGCCGGAAAGGAAACCATCCTGGACAGTGATGCTGATGTTGAACAGCAGTATAAGCATCTTCCCGCGGTAGCGGTCACAGCAGACCAGGAGACAGTAAAAGAATTAAAGCAGGATCCTGATATTTTGTATGTAGAAAACAACGTATCATTTACCGCAGCAGACAGCACGGATTTCAAAGTGCTGTCAGACGGCACTGACACCTCTGACAACTTTGAGCAATGGAACCTTGAGCCCATTCAGGTGAAACAGGCTTGGAAGGCAGGACTGACAGGAAAAAATATCAAAATTGCCGTCATTGACAGCGGGATCTCCCCCCACGATGACCTGTCGATTGCCGGCGGGTATTCAGCTGTCAGTTATACCTCTTCTTACAAAGATGATAACGGCCACGGAACACATGTCGCAGGGATTATCGGAGCCA
AGCATAACGGCTACGGAATTGACGGCATCGCACCGGAAGCACAAATATACGCGGTTAAAGCGCTTGATCAGAACGGCTCGGGGGATCTTCAAAGTCTTCTCCAAGGAATTGACTGGTCGATCGCAAACAGGATGGACATCGTCAATATGAGCCTTGGCACGACGTCAGACAGCAAAATCCTTCATGACGCCGTGAACAAAGCATATGAACAAGGTGTTCTGCTTGTTGCCGCAAGCGGTAACGACGGAAACGGCAAGCCAGTGAATTATCCGGCGGCATACAGCAGTGTCGTTGCGGTTTCAGCAACAAACGAAAAGAATCAGCTTGCCTCCTTTTCAACAACTGGAGATGAAGTTGAATTTTCAGCACCGGGGACAAACATCACAAGCACTTACTTAAACCAGTATTATGCAACGGGAAGCGGAACATCCCAAGCGACACCGCACGCCGCTGCCATGTTTGCCTTGTTAAAACAGCGTGATCCTGCCGAGACAAACGTCCAGCTTCGCGAGGAAATGCGGAAAAACATCGTTGATCTTGGTACCGCAGGCCGCGATCAGCAATTTGGCTACGGCTTAATCCAGTATAAAGCACAGGCAACAGATTCAGCGTACGCGGCAGCAGAGCAAGCGGTGAAAAAAGCGGAACAAACAAAAGCACAAATCGATATCAACAAAGCGCGAGAACTCATCAGCCAGCTGCCGAACTCCGACGCCAAAACTGCCCTGCACAAAAGACTGGATAAAGTACAGTCATACAGAAATGTAAAAGATGCGAAAGACAAAGTCGCAAAGGCAGAAAAATATAAAACACAGCAAACCGTTGACACAGCACAAACTGCCATCAACAAGCTGCCAAACGGAACAGACAAAAAGAACCTTCAAAAACGCTTAGACCAAGTAAAACGATACATCGCGTCAAAGCAAGCGAAAGACAAAGTTGCGAAAGCGGAAAAAAGCAAAAAGAAAACAGATGTGGACAGCGCACAATCAGCAATTGGCAAGCTGCCTGCAAGTTCAGAAAA
AACGTCCCTGCAGAAACGCCTTAACAAAGTGAAGAGCACCAATTTGAAGACGGCACAGCAATCCGTATCTGCGGCTGAAAAGAAATCAACTGATGCAAATGCGGCAAAAGCACAATCAGCCGTCAATCAGCTTCAAGCAGGCAAGGACAAAACGGCATTGCAAAAACGGTTAGACAAAGTGAAGAAAAAGGTGGCGGCGGCTGAAGCAAAAAAAGTGGAAACTGCAAAGGCAAAAGTGAAGAAAGCGGAAAAAGACAAAACAAAGAAATCAAAGACATCCGCTCAGTCTGCAGTGAATCAATTAAAAGCATCCAATGAAAAAACAAAGCTGCAAAAACGGCTGAACGCCGTCAAACCGAAAAAGTAACCAAAAACCTTTAAGATTTGCATTCCAAGTCTTAAAGGTTTTTTTCATTCTAAGAACACCACACACAACCTTTTTCCCATCCATTGTACAGGCTTTTCATACTATTGCTATACAGCCATGAAC</TSeq_sequence>
</TSeq>
</TSeqSet>
새로운:
<?xml version="1.0"?>
<!DOCTYPE TSeqSet PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
<TSeq_seqtype value="protein"/>
<TSeq_gi>490166065</TSeq_gi>
<TSeq_accver>WP_004064712.1</TSeq_accver>
<TSeq_taxid>97253</TSeq_taxid>
<TSeq_orgname>Eubacterium plexicaudatum</TSeq_orgname>
<TSeq_defline>hypothetical protein [Eubacterium plexicaudatum]</TSeq_defline>
<TSeq_length>1508</TSeq_length>
<TSeq_sequence>MKKSFMTRVLAVSLSAAMAFSMSSASNLVTASAASTVNLKTTFKTLKVGQTYKLTLKKNTLNWKITKVQTTNKKICTVYGKTASSVMLKGKGVGRAKISVKVKTTKRKYPKNIKIMKCTANVKAADGSGTTDEFKVTSATASSNTEVRVMFSKAIDAAEMTNFTVSDSVTVSKAELSEDKKSVLLTIAGAEYGKNYELTVNGIKVAGKEQAAQKVTFTTPSASEKYPTTLEAKDPVLASDGHSQTLVTFTIKDANGNPITDKGVEVAFATSLGKFAEQRVSIQNGVATVMYTSEALMETQTSAITATVVESTDNQELMGLSATSSITLTPNPDEFNIVPIITSITAPTADRVIAYFNEKVSASDFKTASGKLDHSKFTANVAWGFDNGFDELGNRLVGRSNVVGILDVPGSDNALQLLVDRPMTDNTNISVTFENKTKASSLVSASNTVYTKLTDAHQPSVLTAKGDGLRTVVVNFSEAVLPTAYCDNVETDKKNANQTLFAADNIENYLIDGKPLSYWGVTEVKTPDSETPDDTSSNLKKESSKNDATKTGSEKPGEIQVGSYKDGEDNRHVVTIKLSRERFLEPGTHSMTISNVGDWAAKTDRERNIVNTQTFDFVVENNDVIPTFEVEEQSPEQWLLKFNSDIEPVSETLTTPNSQYSDQASILKLQELVGSTWVDISDSDAAGKNPIRVSQVDDTRNYVVEVRKDWTEVYNTSSTKQNYFNKQLRLHIDAGKIVNIANNKQNGTIDIPLDGTIMRTPDVVSPEIGEVTPAEDTSGNVLDSYNVKLSEPVKLSDGTGGAGGANGEGLTPSQIQSANGSNSNNQGVPMPSAQFIRVDNGQTVEGIITSNVFVDAYDTTINIAPESALSAGKWRLVISSISDDYGNTASTVAHEIDVTQESVTTDFKIVWAAVSDQQTYAEDHIGVERGRYIFVKFSKPVTMTGNSVNAGVTGNYTVNGATLPTGTQIRANIVGYDDHDAVTDSVTIMLPTGNVNAGWGATGDYTV
SGKNAMLNVSRAITATTGENLSNGGLIRIPFQYGSATEDTGYNDYNDSLTALTDAVWGNYRSETRAGYDNLRDYYKALKSALENDKYRRVVLTAPLDLSNPDDNPNEDQKDAVAVFGRSHTLTIKRAVDFDLNGNNITGNVVISTTDAVNRIKLHSSKERAHIYGYANNKDNVATLTVNAGSAKEFLLDNVEVHETDKGNALNINDTWKASFVNNGVIDGKIRITDTNGCGFKNENTTDGFTNRTRFIIDSTGDVNLKGDLSALRNLTDEFGITVNQAAKLSFGVDSKDETTPCDISGVKIVVRGPGARVIFTPVATTTADTALTAEADNVRVQLSQANSGSGKIQFFTDRGGKIVAVDKDNKEVTSDSKDAVKISSDDIKVTGIQKALENLDVQTGVITDGKVDSTVTISCGAISGGSYNIEELAKNIKKAEFEYKGKPDTTGIVANYSLLSTNLLKKDSTHIWPKDNWTDQKDDVSDTIRVTLAYDGYTMVKYIKVTRV</TSeq_sequence>
</TSeq>
</TSeqSet>
답변4
다음은 전체 CSV 파일을 메모리로 읽는 것을 방지하고 사후 처리를 약간 단순화하는 리팩토링입니다.
# Use lower case for private variables
# and https://mywiki.wooledge.org/DontReadLinesWithFor
while read -r acc; do
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${acc}&rettype=fasta&retmode=xml" |
# Run a single awk script for extraction and formatting
awk -v acc="$acc" '/TSeq_taxid/ {
sub(/>.*/, ""); sub(/.*</, ""); print acc "\t" $0 }'
done <csvfile